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ABSTRACT 



We study three fundamental problems of Linear Algebra, lying in the heart of var- 
ious Machine Learning applications, namely: (i) Low-rank Column-based Matrix 
Approximation, (ii) Coreset Construction in Least-Squares Regression, and (iii) Fea- 
ture Selection in fc-means Clustering. A high level description of these problems is 
as follows: given a matrix A and an integer r, what are the r most "important" 
columns (or rows) in A? A more detailed description is given momentarily. 

1. Low-rank Column-based Matrix Approximation. We are given a 
matrix A and a target rank k. The goal is to select a subset of columns of A and, by 
using only these columns, compute a rank k approximation to A that is as good as 
the rank k approximation that would have been obtained by using all the columns. 

2. Coreset Construction in Least-Squares Regression. We are given 
a matrix A and a vector b. Consider the (over-constrained) least-squares problem 
of minimizing ||Ax — b||2, over all vectors x G "D. The domain V represents the 
constraints on the solution and can be arbitrary. The goal is to select a subset of 
the rows of A and b and, by using only these rows, find a solution vector that is as 
good as the solution vector that would have been obtained by using all the rows. 

3. Feature Selection in K-means Clustering. We are given a set of points 
described with respect to a large number of features. The goal is to select a subset 
of the features and, by using only this subset, obtain a ^-partition of the points that 
is as good as the partition that would have been obtained by using all the features. 

We present novel algorithms for all three problems mentioned above. Our 
results can be viewed as follow-up research to a line of work known as "Matrix Sam- 
pling Algorithms" . Frieze et al [59] presented the first such algorithm for the Low- 
rank Matrix Approximation problem. Since then, such algorithms have been de- 
veloped for several other problems, e.g. Regression [47], Graph Sparsification [131], 
and Linear Equation Solving [128]. Our contributions to this line of research are: 
(i) improved algorithms for Low-rank Matrix Approximation and Regression (ii) 
algorithms for a new problem domain (/T- means Clustering). 
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CHAPTER 1 
INTRODUCTION 

We study several topics in the area of "Matrix Sampling Algorithms". Chapter 3 
gives a comprehensive overview of this area; here, we summarize the main idea. 
Let the matrix A G M™^" be the input to a linear algebraic or machine learning 
problem. Assume that there is an Algorithm that solves this problem exactly in 
0{f{m,n)). Consider the following question. 

Does there exist a small subset of columns (or rows, or elements) in A 
such that running this Algorithm on this subset gives an approximate 
solution to the problem in 0{g{m,n)) time with g{m,n) = o{f{m,n))7 

A positive answer to this question implies the following approach to solve the prob- 
lem: first, by using a matrix sampling algorithm, select a few columns from A; then, 
compute an approximate solution to the problem by running the Algorithm on the 
selected columns. The challenge is to select a subset of columns from A such that 
the solution obtained by running the Algorithm on this subset is as good as the so- 
lution that would have been obtained by running the Algorithm on A. The focus of 
this dissertation is exactly on developing such "good" matrix sampling algorithms. 
We do so for three problems: 

1. Low-rank Matrix Approximation. 

2. Least-Squares Regression. 

3. i^-means Clustering. 

Prior work has offered good matrix sampling algorithms for several other problems, 
including Low-rank Matrix Approximation and Regression, e.g.: (i) Matrix Multi- 
plication [41], (ii) Graph Sparsification [130], and (iii) Linear Equation Solving [129]. 
The - high level - contributions of this thesis are: (i) improved algorithms for Low- 
rank Matrix Approximation and Regression (ii) novel algorithms for a new problem 
domain (/c-means Clustering). 

1 
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Our motivation to study the above problems is two- fold. First, by selecting 
the most important columns from the input matrix, we quickly reveal the most im- 
portant and meaningful information in it. Consider, for example, a matrix describing 
stock prices with the rows corresponding to dates and the columns corresponding 
to stocks. A small subset of columns that reconstructs this matrix corresponds to 
a small subset of stocks that "reconstructs" the whole portfolio. Identifying the 
"dominant" stocks in a portfolio is certainly a (possibly multimillion) worth doing 
task. Second, by selecting a small portion of the input data and solving a smaller 
problem, we are able to improve the computational efficiency of standard algorithms. 
For example, continuing on the theme of the above discussion, solving a regression 
problem with a small number of rows from A and b, as opposed to the (possibly 
severely overconstrained) problem involving A, b would make statistical arbitrage 
algorithms able to deliver solutions much faster, which is important, for example, in 
high-frequency trading strategies. On top of the improved computational efficiency, 
since the smaller problem is often more robust to noise and outliers, these strategies 
would be attractive from the risk minimization point of view. 

The highlights of our contributions for the three problems we mentioned 
above are as follows: 1) We offer fast, accurate, deterministic algorithms for column- 
based low-rank matrix approximations. We achieve computational efficiency by 
introducing a novel framework in Section 2.2 where one is able to work with ap- 
proximate SVD factorizations and select columns from matrices. Previous work 
for column-based low-rank matrix approximations uses the exact SVD, which is 
expensive. We achieve near-optimal and deterministic algorithms by generalizing 
a recent important result for decompositions of the identity [10] in Sections 3.1.7 
and 3.1.8. Previous work offers algorithms that are not optimal and are typically 
randomized. 2) We present the first deterministic algorithm for coreset construction 
in least-squares regression with arbitrary constraints. We achieve that by putting 
the result of [10] in the linear regression setting. Previous work on this topic offers 
randomized algorithms and less accurate bounds than ours. 3) We present the first 
provably accurate feature selection algorithm for /c-means clustering. We achieve 
that by looking at this problem from a linear algebraic point of view. 
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1.1 Structure 

Sections 1.2, 1.3, and 1.4 of this chapter introduce the three problems that we 
study in detail in the present dissertation. For each problem, we give motivating 
examples, a brief review of prior work, and a resume of our results. We conclude 
this chapter in Section 1.5 by describing the main idea behind the algorithms of this 
dissertation and the proof techniques of our results. Chapter 2 introduces the no- 
tation and provides the necessary background from Linear Algebra and Probability 
Theory. Chapter 3 gives a comprehensive overview of existing work on the topic of 
matrix sampling algorithms. Our results on low-rank matrix approximation, least- 
squares regression, and /c-means clustering are given in detail in Chapters 4, 5, and 
6, respectively. Finally, we discuss directions for future research in Chapter 7. 

1.2 Low-rank Column-based Matrix Approximation 

Given A G R™^" of rank p and k < p, the best rank k approximation to A is 

k 



where (Ti > (r2 > ■ ■ ■ > cr^ > are the top k singular values of A, with associated left 
and right singular vectors Uj G M'" and Vj G M", respectively. The singular values 
and singular vectors of A can be computed via the Singular Value Decomposition 
(SVD) of A in deterministic 0{'mnmin{m,n}) time. There is considerable interest 
(e.g. [25, 59, 39, 35, 97, 133, 34, 73]) in determining a minimum set of r <^ columns 
of A which is approximately as good as A^ at reconstructing A. For example, [35] 
provides a connection between a "good" subset of columns of A and the projective 
clustering problem; [39] efficiently computes low-rank approximations by using a 
subset of "important" columns of A; [133] uses carefully selected "informative" 
columns and rows of A to interpret large scale datasets; [25] shows that a small 
"linearly independent" set of columns of A are more robust to noise in least-squares 
regression. 




i=l 
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Problem Setup. Fix A G M™^'" of rank p, integer k < p, and oversampling 
parameter k < r < n. The goal is to construct C G M'"^'' consisting of r columns of 
A. A is the target matrix for the approximation, k is the target rank, and r is the 
number of columns to be selected. We are interested in the reconstruction errors: 

IIA-CC+AII5 and ||A-n^^fc(A)||5, 

for ^ = 2,F. The former is the reconstruction error for A using the columns in C; the 
latter is the error from the best (under the appropriate norm) rank k reconstruction 
of A within the column space of C. For fixed A, k, and r, we would like these errors 
to be as close to ||A — Afc||g as possible. Note that ||A — CC^A||^ < ||A — 11^ ^(A)||^; 
so, the way we will present our results is the following: 

||A-4,fc(A)lk<«l|A-Afc||^. 

a is the approximation factor of the corresponding algorithm; the goal is to design 
algorithms that select r columns from A and offer "small" a. 

Prior Work. If r = A;, prior work provides almost near-optimal algorithms. 
"Near-optimal" means that the factors a offered by the corresponding algorithms 
are - asymptotically - the best possible. So, there is no room to improve on these 
results, modulo running time. We will indeed present fast algorithms for the r = k 
case that are almost as accurate as the best existing ones (see Theorem 37 for ^ = 2, 
Theorem 38 for ^ = F, and Theorem 39 for ^ = 2, F). 

For general r > k, not much is known in existing literature. For spectral norm, 
we are not familiar with any technique addressing this problem; for Frobenius norm, 
existing algorithms are not optimal and work only if r = Q{k\og{k)) [49, 39]. For 
example, [49] describes a 0(mnmin{m, n} + rlog(r)) time algorithm that, with 
constant probability, guarantees: 

l|A - uUmF < ^1 + 0(^^)11 A - A.ll,. 
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Results. We present the first polynomial-time near-optimal algorithms for arbi- 
trary r > k for both ^ = 2,F. To be precise, for ,^ = F, we present a "near-optimal" 
algorithm only for r > lO/c (see Theorem 36 in Section 4.2); for arbitrary r > k our 
result is optimal up to an additive factor 1 (see Theorem 34 in Section 4.2). We 
note that, for ^ = 2 and r > k, the best possible value for a is a = \f^.\ for ^ = F 



and r > k, the best possible approximation is d = Jl + f . We break our results 
into three categories. 

1. For r > k and ^ = 2: (i) Theorem 32 in Section 4.1 presents a deterministic 
algorithm that runs in O (^mnmm{m,n} + rn (A;^ + (p — ^)^)) achieves error: 

l|A-n^,,(A)||2<0(^y^) ||A-A,|h; 

(ii) Theorem 33 in Section 4.1 presents a considerably faster randomized algorithm 
that runs in O {mnklog {k~^ min{m, n}) + nrk'^) and achieves, in expectation, error: 

E 0|A - nlMM < O (y^) ||A - A,h. 

2. For r > k and ^ = F: (i) Theorem 34 in Section 4.2 presents a deterministic 
algorithm that runs in O (mnmin{m,n} -|- nrk'^) and achieves error: 



||A - WcA^)\\f <V2 + {k/r)\\A - A^H^; 

(ii) Theorem 35 in Section 4.2 presents a considerably faster randomized algorithm 
that runs in O {mnk + nrk"^) and achieves, in expectation, error: 



E [||A - n^,fc(A)||^] < v/3 + 0(A;/r)||A - A,||^. 

3. Finally, for r > lO/c and C, = F, Theorem 36 in Section 4.2 presents a 
randomized algorithm that runs in O {mnk + nk^ + nlog{r)) and achieves, in ex- 
pectation, error: 

E [||A - n^,fc(A)||^] < ^l + 0{k/r)\\A - A,||^. 
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1.3 Coreset Construction in Least-Squares Regression 

Linear regression is one of the beloved techniques in the statistical analysis of 
data, since it provides a powerful tool for information extraction [123]. Research in 
this area ranges from numerical solutions of regression problems [12] to robustness 
of the prediction error to noise (e.g. using feature selection methods [70]). We don't 
address neither of these issues here; rather, we focus on constructing coresets for 
constrained least-squares regression. A coreset is a subset of the data that con- 
tains essentially as much information (when viewed through the lens of the learning 
model) as the original data. For example, with support vector classification, the 
support vectors are a coreset [31]. A coreset contains the meaningful or important 
information and provides a good summary of the data. If such a coreset can be found 
quickly, one could obtain good approximate solutions by solving a (much) smaller re- 
gression problem. When the constraints are complex (e.g. non-convex constraints), 
solving a smaller problem could be a significant saving [60]. (See section 3.3 for 
further motivation on least-squares problems with constraints.) 

Problem Setup. Assume m data points (zi, ?/i), . . . , (z^, y^); Zj G M" are fea- 
tures and i/j G M are targets (responses). Typically m ^ n. The linear regression 
problem asks to determine a vector Xopt G P C M" that minimizes 

n 

^(x) = $^(z^x-^/,)^ 

i=l 

over X G so, £{^opt) < ^(x)) Vx G T). The domain T) represents the constraints 
on the solution and can be arbitrary. A coreset of size r < m is a subset of the data, 
(zj^,?/jj, . . . , (zj^,?/j^). The coreset regression problem considers the squared error. 

Suppose that £ is minimized at Xopt, so £{^opt) < ^^(x), Vx G V; the goal is to 
construct a coreset {zi^.tji^), . . . , {zi^,yi^) such that Xop^ is nearly as good as Xopj. 
For fixed A, b, e > 0, and r being as small as possible, the goal is to find ^opt with 



^opt J • 
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Below, we switch to a more convenient matrix formulation of the problem. 
Let A G M*"^" (m ^ n) be the data matrix whose rows are the data points zj, so 
Aij = Zj[j]; and, b G is the target vector, so fej = Hi. Also, i^(x) = ||Ax — bUl 
and 

Xopi = argmin || Ax — h\\l. 

A coreset of size r < m is a subset C G W^'"" of the rows of A and the corresponding 
elements G W of b (possibly rescaled); so, S{x.) = ||Cx — b^lg and 

±opt = argmin ||Cx - bc||2. 

A coreset C, be is (1 + e)-approximate if the corresponding vector St^pt satisfies 

IIAxopt - h\\l < (1 + e)\\A^opt - Hi 

Prior Work. The best (in terms of coreset size) (1 + e)-coreset construction al- 
gorithm is in [47]. It finds a coreset of size r = 0{n\og{n)e~'^), works only for 
unconstrained regression, runs in 0{m'n? + n log(n)e~^ log(?7, log(n)e~^)), and fails 
with some constant probability. There are also several random-projection type al- 
gorithms that construct "coresets" for unconstrained regression [122, 48, 119, 102]. 
Here, the rows in C are linear combinations of the rows of A (similarly for be and 
b). The best of these methods (in terms of "coreset" size) is in [102] and constructs 
a (1 + e)- "coreset" with r = 0{ne~^) . (Table 3.4 in Section 3.3 gives a summary of 
these results.) 

Results. Our main result is Theorem 42 in Section 5.1, which describes a de- 
terministic O {m'n? + n^e"^) algorithm that constructs a (1 + e)-approximate core- 
set of size r = 0(ne~^). Our result improves upon [47] on three aspects: (i) it 
is deterministic, (ii) the coreset size is 0(log(n)) smaller, and (iii) handles arbi- 
trary constraints. Also, Theorem 43 in Section 5.3 presents a randomized algorithm 
that, with constant probability, constructs a (1 + e)-approximate "coreset" of size 
r = 0(?7, ln(n) log(nm)e~^). Our result improves upon [122, 48, 119, 102] by means 
of providing a "coreset" for regression of arbitrary constraints. 
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1.4 Feature Selection in /c-means Clustering 

Clustering is ubiquitous in science and engineering, with numerous and diverse 
application domains, ranging from bioinformatics and medicine to the social sciences 
and the web [78]. Perhaps the most well-known clustering algorithm is the so-called 
"fc- means" algorithm or Lloyd's method [99], an iterative expectation- maximization 
type approach, which attempts to address the following objective: given a set of 
points in a Euclidean space and the number of clusters k, split the points into k 
clusters so that the total sum of the (squared Euclidean) distances of each point to 
its nearest cluster center is minimized. The good behavior of the Lloyd's method 
([99, 113]), have made fc-means enormously popular in applications [142]. 

In recent years, the high dimensionality of the modern massive datasets has 
provided a considerable challenge to /c-means clustering approaches. First, the curse 
of dimensionality makes algorithms for fc-means clustering very slow, and, second, 
the existence of many irrelevant features may not allow the identification of the 
relevant underlying structure in the data [71]. Practitioners addressed such obstacles 
by introducing feature selection and feature extraction techniques. It is worth noting 
that feature selection selects a small subset of actual features from the data and 
then runs the clustering algorithm only on the selected features, whereas feature 
extraction constructs a small set of artificial features and then runs the clustering 
algorithm on the constructed features. 

Problem Setup. Consider m points V = {pi,P2i ■■■,Pm} ^ 1^"; and integer k 
denoting the number of clusters. The objective of /c-means is to find a /c-partition 
of V such that points that are "close" to each other belong to the same cluster and 
points that are "far" from each other belong to different clusters. A /c-partition of 
P is a collection S = {iSi, 1S2, iSfc} of k non-empty pairwise disjoint sets which 
covers V. Let Sj = \Sj\ be the size of Sj. For each set Sj, let Hj G be its centroid 
(the mean point): = i^p.^zs^Vi) / ^j- The /c-means objective function is 

F{V,S) = Y,\\p.-tJi{pi)\\l- 

i=l 
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^J,{pi) G M" is the centroid of the chister to which pi belongs. The goal of /c-means 
is to find the partition 

Sopt = argmin 
s 

The goal of feature selection is to construct r- dimensional points V = {pi,P2, ■■■,Pm} G 
W (r <^ n, and each pi contains a subset of elements of the corresponding pi), such 
that, for Sopt = arg min^ J^(V, S): 

HVjopt) < W + e)T{V,Sopt). 

Here /3 is a small constant, e.g., (3 = 1,2, 3. The parameter e > is given as input 
and one minimizes r to achieve the desired accuracy /3 + e. 

Prior Work. Despite the significance of the problem, as well as the wealth of 
heuristic methods addressing it [70], there exist no provably accurate feature selec- 
tion methods for /c-means clustering. On the other hand, there are two provably 
accurate feature extraction methods. First, a folklore result [85] indicates that one 
can construct r = 0(log(m)e^^) artificial features with Random Projections and, 
with constant probability, get a (1 + e)-approximate clustering. Second, the work 
in [40] argues that one can construct r = k artificial features with the SVD, in 
0{'mnmin{m,n}) time and get a 2-approximation on the clustering quality. 

Results. We present the first provably accurate feature selection algorithm for k- 
means: Theorem 47 in Section 6.1 presents a 0{mnke~^+k \og{k)e~^ \og{k log(A;)e^^)) 
randomized algorithm that, with constant probability, achieves a (3 + e)-error with 
r = 0(/clog(/c)e~^) features. We also describe a random-projection-type feature 
extraction algorithm: Theorem 49 in Section 6.2 presents a 0{mn\e~'^k / \og{n)\) 
algorithm that, with constant probability, achieves a (2 + e)-error with r = 0{ke~'^) 
"features" . We improve the above folklore result by showing that a smaller number 
of dimensions are enough for obtaining an approximate clustering. Finally, Theorem 
50 in Section 6.3 describes a feature extraction algorithm that uses an approximate 
SVD to construct r = k "features" in 0{mnke~^) time such that, with constant 
probability, the error is at most a 2 + e factor from the optimal. 
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1.5 Algorithms and General Methodology 

Algorithmically, to select r columns from A = [ai, a2, a„] G M"*^", we com- 
pute the matrix V/t G M"^^ of the right singular vectors of A and consider the matrix 

= [vi, V2, v„] G M^'^". Here A; <^ m is part of the input of the corresponding 
problem and denotes, for example, the target rank for the approximation (Low-rank 
Matrix Approximation) or the number of clusters (fc-means Clustering). Notice that 

has the same number of columns with A. So, in some sense, selecting columns 
from A and Vj is equivalent, since there is a one-to-one correspondence between 
the columns of these matrices. Actually, the columns of is a "compact" rep- 
resentation of the columns of A in some fc- dimensional space, which is exactly the 
subspace of interest for the corresponding problem. To construct (rescaled) columns 
C = AfiS, it suffices to construct sampling and rescaling matrices To do so, 

we select columns from such that the submatrix V^f2S has columns that are 
as "linearly independent" as possible. Intuitively, this means that we select the 
columns of that capture essentially all the information in V^, and, since 
is a compact representation of A, this corresponds to selecting columns of A that 
capture most of the information in A. Notice that has rank at most k, so there 
are exactly k columns that are linearly independent, but there might be many such 
/c-subsets of which we do not know what is the best. The challenge is to find the 
"best" such subset of r > /c columns and to do so in low-order polynomial time. 
Evaluating the "linearly independence" of a set of columns can be done through 
different quantities, for example, the determinant, the volume of the parallelepiped 
spanned by those columns, or the singular values of the matrix formed by those 
columns. We will mostly do so by measuring the smallest singular value of V^f2S. 
Notice that ^^(V^fiS) = means that the selected submatrix has rank less than 
k; on the other hand, 0"fc(Vjr2S) ^ implies that the selected columns are almost 
as linearly independent as possible. To summarize, our main goal is to construct 
matrices f2, S and guarantee that (jfc(Vjr2S) is as large as possible. The later goal 
can be viewed as the main task behind almost all the column sampling techniques 
of Section 3.1. We should note here that Section 3.1 presents several methods for 
selecting columns from short-fat matrices of orthonormal rows; then, in Chapters 4, 
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5, and 6 we show how to use these elementary methods for three different problems. 
An important issue is that sometimes we want to ensure that both crfc(V^fiS) is 



large and o"i(Vj_^.f2S) is small, i.e. we want to construct f2, S that simultaneously 
select columns from two different matrices. Existing techniques do not provide such 
advantages. To do so, we developed novel sampling techniques for selecting columns 
from two different matrices simultaneously in Sections 3.1.7 and 3.1.8. Another im- 
portant observation is that we developed a novel theory (see Lemma 5) in Section 2.2 
that shows that there is no need to compute exactly the matrix V^, approximations 
suffice. This theory allows us to design fast algorithms without the SVD. 

Proof Techniques. Our proofs rely on matrix perturbation theory and the 
aforementioned spectral properties of the submatrix V^r2S. As an example of the 
proof techniques that we employ here, consider Lemma 6 in Section 2.2 with W = QS 
and ^ = 2 (after some standard properties of matrix norms): 

l|A - Ill,,{A)\\l < ||A - A,\\l + ||(A - A,)||^||Vj„,fiS||^||(VjnS)+||i 

To prove this general bound, we combined some results from matrix perturbation 
theory with the matrix analog of the Pythagorean theorem. This generic bound 
is quite useful: it indicates that for any sampling and rescaling matrices Q, S the 
approximation error is bounded from above by certain terms, so, it tells us that we 
should focus on algorithms that control exactly these terms. Consider, for example, 
constructing a sampling and a rescaling matrix with the technique of Section 3.1.7: 




\\{nnsm<{l-\l-] ; and ||Vj_,fiS||^ < | 1 + 



p — k 



Combine these bounds with our generic equation and take square roots on both 
sides of the resulting equation: 

l|A - n2_,(A)||2 < ||A - A,|h + ||(A - A,)|h 

We just proved one of the main results of this thesis (Theorem 32)! Very similar 
proof techniques are used in the rest of our results. 
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CHAPTER 2 
PRELIMINARIES 



This chapter introduces the notation that we use throughout this document and 
provides the necessary background from Linear Algebra and Probabihty Theory. 

Basic Notation 

We use A, B, . . . to denote matrices; a, b, . . . to denote column vectors. A = 
[ai, . . . , a,J G M*"^" represents a matrix with columns ai, . . . , a„ G M™. I„ is the 
n X n identity matrix; Omxn is the m x n matrix of zeros; 1„ is the n x 1 vector of 
ones; is the standard basis (whose dimensionality will be clear from the context). 
Aij denotes the (i,j)-th element of A. Logarithms are base two. We abbreviate 
"independent identically distributed" to "i.i.d" and "with probability" to "w.p". 

Sampling Matrices 

This whole dissertation is about sampling columns from matrices; here, we in- 
troduce the notation that we will use to describe such a process. Let A = [ai, . . . , a„] 
and C = [aj^, . . . , a^^] be r columns of A. We can equivalently write C = AQ, where 
the sampling matrix isQ = [e^^, . . . , Cj^] and are standard basis vectors in M". Let 
S denote an r x r diagonal reseating matrix with non-zero entries; then, C = AQS 
contains r columns from A rescaled with the corresponding diagonal element of S. 
Notice that Ar2(Ar2)"'~ = Ar2S(Af2S)"'", because rescaling C does not change the 
subspace spanned by its columns. So, in Chapter 4 all the results hold for C = AQ 
as well (we stated the results for C = Af2S). In Chapters 5 and 6 the rescaling can 
not be ignored. 

A permutation matrix is a special case of a sampling matrix where, for some 

permutation tt of [1, . . . , n], LI = [e^^, e^^, . . . , e^^^] G M"^"; i.e. ALI G M"*^" contains 

the columns of A just permuted according to tt. 

^Portions of this chapter previously appeared as: C. Boutsidis, P. Drineas, and M. Magdon- 
Ismail, Near-Optimal Column-Based Matrix Reconstruction, arXiv: 1103.0995, 2011. 
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Matrix norms 

We use the Frobenius and the spectral norm of a matrix: ||A||p' = y^'Xljj 
and ||A||2 = maXx;||x||2=i ||Ax||2, respectively. For any two matrices A and B of 
appropriate dimensions, ||A||2 < ||A||j7 < A/rank(A) || A||2, ||AB||i? < || A||ir||B||2, 
and IIABIIf < ||A||2||B||i?. The latter two properties are stronger versions of the 
standard submultiplicativity property: ||AB||^ < || A||^||B||g. We will refer to these 
two stronger versions as spectral submultiplicativity. The notation ||A||^ indicates 
that an expression holds for both ^ = 2 and C, = F. 

Singular Value Decomposition 

The Singular Value Decomposition (SVD) of the matrix A with rank(A) = p 

is: 

with singular values cri > . . .au > cr^+i > . . . > dp > 0. We will often denote 
(Ji as cTmax and dp as o"min, and will use CTj (A) to denote the i-th singular value of 
A when the matrix is not clear from the context. The matrices G M™^'^ and 
Up-k e M"^^ contain the left singular vectors of A; and, similarly, the matrices 
Vfe G M.^^'' and Vp^k G ]R"^(P"'=) contain the right singular vectors of A. It is 
well-known that A^ = UfeS^Vj G M"*^" minimizes ||A — X||^ over all matrices 
X G R™^" of rank at most k. We use Ap^k G W^^"' to denote the matrix A — A^ = 
Up_fcS,_fcVj„, G M'^x". Also, ||A||^ = VELi^KA) and ||A||2 = a,{A). The 
best rank k approximation to A satisfies ||A — Afc||2 = a"fc+i(A) and ||A — A^Hi? = 

Let B G M"'''" (m < n) and A = BB^ G then, for all i = l,...,m, 

Aj (A) = af (B) denotes the z-th eigenvalue of A. We will also use A^in (A) and 
-^max (A) to denote the smallest and largest eigenvalue of A, respectively. Any matrix 
A that can be written in this form A = BB""" is called a Positive Semidefinite (PSD) 
matrix; clearly, all eigenvalues of a PSD matrix are non-negative. 
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Perturbation Theory, Quadratic Forms, and PSD Matrices 

Lemma 1. Let V G M"^^' with n > k and V^V = Ifc. LetO <€<1. LetW e M"^^ 
with k < r < n. Then, the following six statements are equivalent. 

1. IIV^WW^V- Ifclla < e. 

2. For alii = l,...,k: 1- e < Ai(V^WW^V) < 1 + e. 

3. For alli = l,...,k: l-e< af{V^W) < 1 + e. 

4. For any vector y G R^' (l-e)yTV^Vy < y^V^WW^Vy < (l + e)yTV^Vy. 

5. For any vector y e R'' : (1 - e)||Vy||| < HW^^VylH < (1 + e)||Vy|||. 

6. (1 - e)V^V ^ V^WW^V ^ (1 + e)V'^V. (Partial ordering - see proof.) 

Proof. Assume that the first statement in the Lemma is true. We first recall the 
standard perturbation result for eigenvalues, which says that the eigenvalues of the 
real square symmetric matrices X and X + E satisfy: |Ai(X) — Aj(X + E)| < ||E||2. 

The second statement of the Lemma follows from the first statement and the 
above perturbation result with X = V'^WW^V and E = V^WW^V - 1^. 

The third statement follows from the second statement by using the relation 
Ai(V^WW^V) = afiV^W). 

To prove the fourth statement, we will use a property of the Rayleigh quotient 
of a square symmetric matrix X. For a vector y define i?(X, y) = ^y^y ■ It is well 
known that for all y: Am,in(X) < i?(X, y) < Amax(X). To conclude, use the later 
equation with X = V^WW"'"V along with the second statement of the Lemma. 

The fifth statement follows from the fourth statement by using that for any 
vector X, ||x||2 = x^x. To conclude, use this twice for x = Vy and x = W"'"Vy. 

In the sixth statement, for two matrices X and Y, X ^ Y denotes the fact 
that the matrix Y — X is a Positive Semidefinite (PSD) matrix, i.e. the statement 
says that both V^WW^V - (1 - e)V^V and (1 + e)V^V - V^WW^V are PSD. 
First, recall the definition of a PSD matrix: a square symmetric matrix X is PSD 
if and only if for any vector y: y^Xy > 0. The left inequality follows by the left 
inequality of the fourth statement; similarly for the right. ■ 
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Moore-Penrose Pseudo-inverse 

= VaS^^UJ G k^x™ denotes the Moore-Penrose pseudo-inverse of A G 
j^mxn. (xi^^ is the inverse of Sa), i-e. the unique n x m matrix satisfying all four 
properties: A = AA+A, A+AA+ = A+, (AA+)T = AA+, and (A+A)^ = A+A. 
By the SVD of A and A"^, it is easy to verify that, for all i = 1, = rank(A) = 
rank(A^), (Tj(A^) = l/(Tp_j+i(A). We will also use the following standard result. 

Lemma 2. Let A G M'"''", B G W"^; then, (AB)+ = B+A+ if at least one of the 
three hold: (i) A = I„; (ii) 3^3 = 1^; or, (in) rank(A) = rank(B) = n. 

Matrix Pythagoras Theorem 

Lemma 3. //X, Y G M"^" and XY^ = 0„,xm or X^Y = 0„xn, then 

\\X + Y\\l=\\X\\l+\\Y\\l, 
max{||X||2, ||Y||2} < ||X + Y||2 < ||X||2 + ||Y||^. 

Proof. Since XY^ = 0„xm, (X + Y)(X + Y)^ = XX^ + YY^. For ^ = F, 

\\X + Y\\l = Tr ((X + Y)(X + Y)^) = Tr (XX^ + YY^) = ||X||2, + ||Y||2,. 

Let z be any vector in M"^. For ^ = 2, 

||X + Y||2= max z^(X + Y)(X + Y)^z = max (z^XX^z + z^YY^z) . 

||z|i2=l I|z|i2=l 

The bounds follow from the following two relations: 
max (z^XX^z + z^YY^z) < max z^XX^z + max z^YY'^z = ||X||2 + ||Y||2; 

INI!2=1 ^ ^ ~ ||Z||2 = 1 ||Z||2=1 ' ' 

max (z^XX^z + z^YY^z) > max z^XX^z = ||X||^, 

||z||2=l l|zi|2=l 

since z"'"YY"'"z is non-negative for any vector z. We get the same lower bound with 
II Y II 2 instead, which means we can lower bound with max{||X||2, IIYII2}. The case 
with X^Y = 

nxn can be proven similarly. ■ 
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Projection Matrices 

A square matrix P G M"^" is a projection matrix if = P. For such a 
projection matrix and any matrix X: 

IIPXII5 < IIXII5. 

Also, if P is a projection matrix, then, I„ — P is a projection matrix. So, for any 
matrix X, both XX"*" and I„ — XX"*" are projection matrices. 

Markov's inequality, the Union Bound, and Boosting 

We use E[x] to take the expectation of a random variable x and Pr[(£^] to take 
the probability of a probabilistic event S. 

Markov's inequality can be stated as follows: Let x be a random variable 
taking non-negative values with expectation E [x]. Then, for all t > 0, and with 
probabihty at least 1 — t^^, 

x<t-E[x]. 

We will also use the so-called union bound. Given a set of probabilistic events 
81,82, ■■■ holding with respective probabilities pi,p2, . . . , Pn, the probability that 
all events hold (a.k.a., the probability of the union of those events) is upper bounded 

n 
i=l 

Sometimes we state our non-deterministic results in terms of their expected 
approximation behavior. An application of Markov's inequality gives the result with 
constant probability; then, standard boosting techniques suffice to make the failure 
probability arbitrarily small by repeating the algorithm many times and keeping the 
best result. Using this approach, to make the failure probability 6 arbitrarily small, 
it suffices to repeat the algorithm 0(log(l/5)) times and keep the best result. One 
should be careful though because identifying the best solution might be an expensive 
task. 
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Random Sampling Techniques: Implementation Issues 

We often sample columns from matrices randomly based on a probability dis- 
tribution over the columns. Once this probability distribution is computed, then, 
the sampling process has nothing to do with the matrix and the problem is reduced 
to sampling a subset of indices based on this distribution. Here, we discuss how to 
implement this process in three different scenarios: (i) sampling with replacement, 
(ii) sampling without replacement, and (iii) uniform sampling. In all three cases 
the setting is as follows: we are given indices {1,2,... ,n} (that correspond to the 
columns of a matrix A G M™^"), a probability distribution pi,p2, ■■■,Pn over these 
indices, and a sampling parameter < r < n. We are asked to select r indices. 

Sampling with Replacement. This corresponds to performing r i.i.d trials of 
the following random experiment: throw a biased die with n faces each one occurring 
w.p Pi. One way to implement this process in a computer programming language is 
as follows. First, generate r numbers rji, 772, Vr i-i-d from the normal distribution. 
Then, in j = 1, r rounds find an index i with pi > rjj. This approach needs 0(r + 
nr). This process can be implemented thought more efficiently in 0{n + r log(r)). 

Sampling without Replacement. For each index i = l,...,n, one computes 
Qi = min{l, rpi} and selects this particular index i with probability gj (flip a biased 
coin for each index separately). This can be implemented, for example, by the use 
of a Gaussian random number generator. For a fixed i, we can generate a number 
rji G A/'(0, 1) and if Qi < rji, we select i, otherwise we do not. The overall process needs 
0{n) time, since we need n multiplications for all rpi, n comparisons to compute 
the gi's, 0{n) to compute n i.i.d numbers from the normal distribution, and finally, 
n comparisons to compare these numbers with the g^'s. 

Uniform Sampling with Replacement. Here, for j = 1, .., r i.i.d trials one has 
to implement the following experiment: select the index i from the set {1,2, ...,n} 
with probability pi = ^. We can use a random number generator returning instances 
from the discrete uniform distribution. We only need r i.i.d such numbers to sample 
r indices, which takes 0(r). 
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2.1 Best rank k Approximation nQ^(A) within a Subspace 

Given A G M™^", integer k, and C G M™^*" with r > k, we define the matrix 
n^^(A) G M"^^" as the best approximation to A (under the ^-norm) within the 
column space of C that has rank at most k; so, n^^(A) G M"*^" minimizes the 
residual ||A — A||^, over all A G M"*^" in the column space of C that have rank at 
most k. In general, ;.(A) ng,^(A). We can write n^_fc(A) = CX^, where X^: 

X^ = argmin ||A - C\I/|||. 

*eK''''":rank(*)<fc 

In order to compute (or approximate if exact computation is not obvious) 11^ fc(A) 
given A, C, and k, we will use the following algorithm: 

1: Orthonormalize the columns of C in 0{mr'^) time to construct Q G M™^''. 

2: Compute (Q'^A)^ G W'^ via the SVD in 0{mnr + nr^) time; (Q^A)^ is the 

best rank-Zc approximation of Q"'"A. 
3: Return n^_fc(A) = Q (Q^A)^ G M'"''" in 0{mnk) time. 

Note that though 11^ ^.(A) can depend on ^, our algorithm computes the same 
matrix, independent of ^. The next lemma, which is essentially Lemma 4.3 in [30] 
together with a slight improvment of Theorem 9.3 in [73], proves that this algorithm 
computes 11^^ (A) and a constant factor approximation to 11^ ^.(A). 

Lemma 4. (See Appendix for the proof.) Given A G M™"^", C G M™^'' and k, the 
matrix Q (Q"'"A)^ G M™^" described above (where Q is an orthonormal basis for the 
columns of C) can be computed in O {mnr + (m + n)r^) time and satisfies: 

l|A-Q(Q^A)J|^ = l|A-nS,,(A)|||, 
||A-Q(Q^A)J|^ < 2||A-n^,,(A)||^. 

Remark 1. In the context of the definition of 11^ ^(A), C can be any matrix, not 
necessarily a subset of the columns of A. 

Remark 2. An interesting open question is whether one can compute 11^ ^(A) 
exactly or obtain a better than a 2-approximation as in Lemma 4. 
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2.2 Column-based Matrix Reconstruction through Matrix 
Factorization 

This section presents the fundamental idea underlying all our results regarding 
low-rank column-based matrix approximation (Chapter 4). Lemma 5 below draws 
a connection between matrix factorizations and column-based reconstruction. The 
factorizations that we will see in this Lemma are of the following form 

A = BZ^ + E, 

where B G M™^'^, Z G M'^^'^, E G M"*^", and Z consists of orthonormal columns. 
Lemmas 6, 7, and 8 discuss algorithms to construct such factorizations. The proofs 
of the results of this section are given in the Appendix. Recall that k is the target 
rank for the approximation. 

Lemma 5. Let A = BZ^ + E, with EZ = Omxk and Z^Z = Ifc. Lei W G M"'''' be 
any matrix such that rank(Z'^W) = rank{Z) = k. Let C = AW G M™^''. Then, 



A - CC+A||' < ||A - n^,;,,(A)||' < ||E||| + ||EW(Z^W) 




View C as a dimensionally- reduced or sampled sketch of A; W is the dimension- 
reduction or sampling matrix, for example, W = fiS, for some sampling and rescal- 
ing matrices f2, S, such that C contains (rescaled) columns of A. In words. Lemma 
5 argues that if the matrix W preserves the rank of an approximate factorization 
of the original matrix A, then, the reconstruction of A from C = AW has an error 
that is essentially proportional to the error of the approximate factorization. The 
importance of this lemma is that it indicates an algorithm for matrix reconstruction 
using a subset of the columns of A: first, compute any factorization of the form 
A = BZ""" + E satisfying the assumptions of the lemma; then, compute sampling 
and rescaling matrices W = fiS which satisfy the rank assumption and control the 
error ||EW(Z^W)+||^. 

An immediate corollary of Lemma 5 emerges by considering the SVD of A. 
More specifically, consider the following factorization of A: A = AVfcV^ + (A — A^), 
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where is the matrix of the top k right singular vectors of A. In the parlance of 
Lemma 5, Z = Vfc, B = AV^, E = A — A^, and clearly EZ = Omxk- 

Lemma 6. Let W G M"^^ be a matrix with ran/c(V^W) = k. Let C = AW; then, 

\\A - CC+A||^' < ||A - 4,,(A)||J < ||A - A4I + ||(A - A,)W(VrW)+||^'. 

The above lemma will be useful for designing the deterministic (spectral norm and 
Frobenius norm) column-reconstruction algorithms of Theorems 32 and 34 in Chap- 
ter 4. However, computing the SVD is costly and thus we would like to design a 
factorization of the form A = BZ""" + E that is as good as the SVD, but can be 
computed much faster. The next two lemmas achieve this goal. The proposed al- 
gorithms are extensions of the algorithms presented in [141, 73]. We will use these 
factorizations to design fast column reconstruction algorithms in Theorems 33, 35, 
and 36 in Chapter 4. 

Lemma 7 (Randomized fast spectral norm SVD). Given A G M'"^"- of rank p, a 
target rank 2 < k < p, and < e < 1, there exists an algorithm that computes a 
factorization A = BZ""" + E, with B = AZ, Z"'"Z = 1^, and EZ = Omxk such that 



< (^V2 + e) 



EfllElUl < V2 + e IIA- A, 



fe 2- 



/ logffc ^ min{m,7i} ) \ 

i he proposed algorithm runs m U I mnk — iog(i+e) ] ™" '^^^ ^'^^ 

statement Z = FastSpectralSVD{A, k,e) to denote this procedure. 

Lemma 8 (Randomized fast Frobenius norm SVD). Given A G R'"^" of rank p, 
a target rank 2 < k < p, and < e < 1, there exists an algorithm that computes a 
factorization A = BZ""" + E, with B = AZ, Z'^Z = 1^, and EZ = Omxk such that 

E [\\E\\l] < {1 + e)\\A - A,\\l. 

The proposed algorithm runs in O {mnke"^) time. We will use the statement Z = 
FastFrobeniusSVD{A, k,e) to denote this procedure. 
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Remark. Notice that any /3- approximation to || A — IIq ^..(A) ||| in Lemma 4 implies 
a (v^ + e) -approximation in Lemma 7; in particular a /3 = (1 + e) -approximation 
would imply a relative error approximation in Lemma 7. To our best knowledge, 
that would be the first relative error fast low-rank approximation algorithm with 
respect to the spectral norm. 



CHAPTER 3 
BACKGROUND AND RELATED WORK 

The primary goal of this chapter is to give a comprehensive overview of the broad 
topic of matrix samphng algorithms. We achieve this goal in three steps. 

First, in Section 3.1, we present (randomized and deterministic) techniques 
for selecting columns from matrices. In total, we present ten such techniques. Most 
of these techniques appeared in prior work; for example, the randomized method 
of section 3.1.4 became popular [50, 131] after it introduced in the celebrated work 
of Rudelson and Virshynin [120], and the deterministic technique of section 3.1.5 
corresponds to the seminal work of Gu and Eisenstat on Strong Rank-Revealing 
QR Factorizations [66]. Two of them are novel techniques (see Sections 3.1.7 and 
3.1.8) and two of them are not exactly techniques for column selection, rather they 
construct a small subset of columns that are linear combinations of the columns of 
the input matrix (see Sections 3.1.9 and 3.1.10). We use these ten elementary 
techniques in our algorithms in Chapters 4, 5, and 6. Each of these ten 
elementary techniques will be supported by a Lemma that, in some sense, describes 
the quality of the sampled columns. These lemmas describe the "spectral properties" 
of the submatrices constructed with the corresponding technique. These lemmas lie 
in the heart of the proofs presented in Chapters 4, 5, and 6. 

Second, we give a precice description of prior work directly related to the three 
problems that we study in the present dissertation (see Sections 3.2, 3.3, and 3.4). 

Finally, we attempt to draw the "big picture" of the topic of matrix sampling 
algorithms. We do so by presenting several related problems from this area in 
Section 3.5. 

^Portions of this chapter previously appeared as: C. Boutsidis and P. Drineas, Random Pro- 
jections for the Nonnegative Least Squares Problem, Linear Algebra and its Applications, 431(5- 
7):760-771, 2009, as C. Boutsidis, A. Zouzias, and P. Drineas, Random Projections for fc-means 
Clustering, Advances in Neural Information Processing Systems (NIPS), 2010, as C. Boutsidis, 
M.W. Mahoney, and P. Drineas, An Improved Approximation Algorithm for the Column Subset 
Selection Problem, Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algo- 
rithms (SODA), 2009, and as C. Boutsidis, P. Drineas, and M. Magdon-Ismail, Near-Optimal 
Column-Based Matrix Reconstruction, arXiv:1103.0995, 2011. 
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3.1 Sampling Techniques for Matrices 
3.1.1 Randomized Additive-Error Sampling 

Frieze, Kannan, and Vempala [59] presented the first algorithm for fast column- 
based low-rank matrix approximations. The algorithm of the lemma below is ran- 
domized and offers "additive-error" approximation guarantees, i.e. the approxima- 
tion error ||A — n^^(A)|||, is upper bounded by the "optimal" term ||A — Afc||p 
plus an additive term which depends on ||A|||,. A relative error algorithm would 
replace this term with ||A — Afc|||,, delivering a much more accurate worst-case ap- 
proximation bound. The advantage of the algorithm of the following lemma is that 
its running time is essentially linear on the dimensions of A. 

Lemma 9. Given a matrix A G M'"-^" of rank p, a target rank k < p, and an 
oversampling parameter < r < n, there is a 0{mn+rlog{r)) randomized algorithm 
to construct C E R'^^^- 

E [||A - CC+AIII] < E [||A - Ul,{A)\\l] < ||A - A^l + ^||A|||. 

We will write C = AdditiveSampling{A,r) to denote this randomized algorithm. 

Let aj G M™ denotes the i-th column of A as a column vector. The algorithm 
mentioned in the lemma can be implemented as follows. For i = 1, ...,n compute: 

r** 2 

Now, construct a sampling matrix Q G M"^'' as follows. Initially, Q = 0„xr- Then, 
for every column j = l,...,r of Q, independently, pick an index i from the set 
{1, 2, n} with probability pi and set Qij = 1. Return C = AQ. One needs 0{mn) 
time to compute the sampling probabilities and 0{n + rlog(r)) to choose the r 
columns, in total 0{mn + rlog(r)). Although, AdditiveSampling is not used in 
later chapters, we included it in our discussion since it is the first matrix sampling 
algorithm; also, it serves as a prequel to the presentation of the algorithm in the 
next subsection, which we will use in Theorem 36 in Section 4.2. 
Remark. An interesting open question is whether there exists a deterministic 
algorithm achieving a similar "additive-error" approximation bound. 
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3.1.2 Randomized Adaptive Sampling 

As in [59] , Desphande et al [35] continue on the topic of fast column-based low- 
rank matrix approximations and present an extension of the seminal result of [59]. 
More specifically, they ask and answer in affirmative and constructively the following 
question: given A and an initial "good" subset of columns Ci, is it possible to select 
columns from A and improve the above additive-error algorithm? The following 
lemma, which is Theorem 2.1 of [35], presents an algorithm that replaces the additive 
error term ||A|||, in the bound of Lemma 9 with the term ||A — CiC^A|||,. 

Lemma 10. Given a matrix A G M™^" of rank p, a matrix Ci G W^^"^ consisting of 
r columns of A, a target rank k < p, and an oversampling parameter < s < n — r, 
there is a 0{mrmin{m,r} + mrn + slog(s)) randomized algorithm to construct 

C ^ ]^mx(r+s) . 



We write C = AdaptiveSampling{A, Ci, s) to denote this randomized algorithm. 

The algorithm mentioned in the lemma is similar with the one described in 
the previous section with the only difference being the sampling probabilities which 
now depend on the matrix A — CiC^A instead of the matrix A. More specifically, 
define the residual error matrix B = A — CiC^A G M™^". For i = 1, . . . ,n, let 



where bj is the i-th column of B. Construct a sampling matrix f2 G W^^^ as fol- 
lows. Initially, Q = Onxs- Then, for every column j = of Q, independently, 
pick an index i from the set {l,2,...,n} with probability pi and set Qij = 1. Let 
Ca = Afi G M™''" contains the s sampled columns; then, C = [Ci Cs] G M™x(^+^) 
contains the columns of both Ci and C2, all of which are columns of A. One 
needs 0(mr min{m, r}) to compute , 0{mnr) to construct B, 0{mn) to con- 
struct the Pi's, and 0{n + slog(s)) to sample the additional s columns; in total 
0{mr min{m, r} + mnr + slog(s)). 

Remark. Lemma 9 is a special case of Lemma 10 with Ci being an empty matrix. 



E [||A - CC+A 



,] < E [||A - n^,fc(A)||^] < ||A - AkWl + -||A - CiC+A 



Pi = 
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3.1.3 Randomized Volume Sampling 

Desphande and collaborators [35, 39, 34] introduced a randomized technique 
that samples column submatrices from the input matrix with probabilities that are 
proportional to the volume of the simplex formed by the columns in the submatrix 
and the origin. Notice that this technique can either sample single columns (subma- 
trices with just one column) or sets of multiple columns. Similarly to the techniques 
of the previous two sections, the columns returned by volume sampling approaches 
offer provably accurate column-based low-rank approximations with respect to the 
Frobenius norm. [35, 39, 34] discuss in detail this line of research; here, we give a 
quick summary and highlight the most important results. 

Lemma 11 (Theorem 1.3, [35]). Fix A G M™^" of rank p and target rank k < p. 
Fori = 1,..., [^) , consider all possible matrices Ci, C2, Cj-n-^ G W^^^ consisting 
of k columns of A, and probabilities 

_ det (C.^CQ 

Pi ~ TTTn • 

E,yidet(CjC,) 
In one random trial, pick the matrix Cj with probability pi; then: 

E[||A-C,C+A|||] < (A; + 1)||A-A,,|||. 

Unfortunately, the above lemma does not indicate an efficient algorithm to compute 
the probabilities p^'s; so, from an algorithmic perspective, the result is not useful. 
Propositions 1 and 2 in [39] made progress by quickly approximating these probabil- 
ities at the cost of replacing the approximation factor (k + l) with {k + l)\. Recently, 
Theorem 7 in [34] presented an efficient randomized algorithm for computing these 
probabilities exactly in 0{knm^\og{m)). A derandomization of this randomized 
algorithm led to a deterministic column-sampling algorithm that, again, runs in 
O {knm^ log{m)) and achieves approximation error {k + 1). Finally, by leveraging 
a random projection type result of [101], Theorem 9 in [34] presents an algorithm 
that, for any < e < 1, runs in 0{nmlog{n)k^e~^ +nlog^{n)k'^e~^ log{ke^^ log(n))), 
and achieves approximation E [||A - CC+A|||,] < (1 + e)(A; + 1) || A- Afc|||. Finally, 
recently [68] volume sampling extended to sample any r > k columns. 
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3.1.4 Randomized Subspace Sampling 

Definition 12 (Random Sampling with Replacement [120]). Let X G M"^^ with 
n > k; X.J G M}^^ denotes the i-th row o/ X and < /3 < 1. For i = l,...,n, if 
(3 = 1, then pi = (x^Xj)/||X|||., otherwise compute some pi > /3(x^Xi)/||X|||, with 
Y17=iPi ~ ^ '^^ integer with 1 < r < n. Construct a sampling matrix 

Q G M"^'' and a rescaling matrix S G W^^ as follows. Initially, Vt = 0„,xr o^n-d 
S = Orxr- Then, for every column j = 1, ...,r ofQ, S, independently, pick an index 
i from the set {1,2, ...,n} with probability pi and set Qij = 1 and Sjj = 1 / ^/PiV . To 
denote this 0{nk + rlog(r)) time randomized procedure we will write: 

[fi, S] = SubspaceSampling(X, (3,r). 

It is interesting to consider applying this technique for selecting columns from 
short-fat matrices of orthonormal rows. Drineas et. al [47] are credited for applying 
the method of [120] to such matrices. The term "subspace sampling" is from [47] 
and denotes the fact that the sampled columns "capture" the subspace of interest. 

Lemma 13 (Originally proved in [120]). Let V G M"^'^ with n > k and V'^Y = 
Ifc. Let < f3 < 1, < 6 < 1, and 4:k\n{2k/6)/ (3 < r < n. Let [1], S] = 
SubspaceSampling{V , f3,r) . Then, for all i = 1, ...,k, w.p. atleastl — 6: 

Proof. In Theorem 2 of [100], set S = I and replace e in terms of r, /3, and d. The 
lemma is proved; one should be careful to fit this into our notation. ■ 

Lemma 14 (See Appendix for the proof). For any (3, r, X. E M"^'^, andY G W^^"', 
let [Q, S] = SubspaceSampling{X., 13 , r) ; then, w.p. 1 — 6: ||Yr2S|||. < IHYHI,. 

Remark. Quite recently, Zouzias [143] proved the deterministic analog of Lemma 
13. More specifically. Theorem 11 in [143] proves that, on input V G M"^'^ and 
r > 30A;ln(2A;), there is a deterministic 0{nrk\og^{k)) algorithm that returns G 
S G such that, for all i = l,...,k: 1 - ^30k\n{2k)/r < afCV^nS) < 
1 + ^y30k \n{2k)/r. The notation O(-) hides log(log(l/e)) and \og{\og{k)) factors. 
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3.1.5 Deterministic Sampling with the Strong RRQR 

Gu and Eisenstat [66] give a deterministic algorithm for selecting k < n 
columns from generic m x n matrices. Here, we are only interested in sampling 
k columns from k x n short-fat matrices; so, we will restate the main result of [66] 
as Lemma 15 below to fit our notation. (Lemma 15 is not immediate from [66], so 
we include a proof in the Appendix.) Lemma 16 is a simple corollary of Lemma 15. 
Notice that our lemmas here guarantee the existence of a deterministic technique; 
the actual description of this technique can be found as Algorithm 4 in [66] . 

Lemma 15 (Extension of Algorithm 4 of [66] to short-fat Matrices). Let X G M"-^™ 
(m < n) have rank px, and 1 < k < px- Let / > 1. There is a deterministic 
0{mnk\ogj{n)) algorithm to construct a permutation matrix 11 G M"^'", a matrix 
Qx G IR'"^'" with orthonormal columns, and a matrix R G M™^".- 

A B 

X^n = QR = Qx| I ; Afc G R*^^^ Bfc G M'^^("-'=);Cfc G M("'-'=)^("-'=); 

such that for all i = 1, ...,k, and j = k + 1, m; 

/jf^^], , < ^^(A/c) < a.(XT);a,(XT) < a,_,(C,0 < V/^M^ - A:) + la,(X^ 

Lemma 16. Let X G W^^^ with n > k; there exists a deterministic algorithm that 
runs in 0{nk'^\og{n)) time and constructs a sampling matrix Q G M"^^ such that 

afc(X^fi) > ^^^"^ ^ — and < \\X^h. 

We write VL = RRQRSampling(K, k) to denote such a deterministic procedure. 

Proof. Compute a QR factorization of X with Lemma 15 and let f2 be the first k 
columns of the matrix 11. The bound for a^^XyVt) follows by applying the first 
bound in Lemma 15 with i = k and / = 2. The ^-norm upper bound follows by 
spectral submultiplicativity and ||f^||2 = 1- The running time is from Lemma 15 
with m = k. ■ 
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3.1.6 Deterministic Sampling with the Barrier Method I 
Lemma 17 (Single-set Spectral Sparsification [10]). Let V G W^'^ with V^V = 1^. 
Let r is an integer and assume k < r < n. One can construct (deterministically, in 
0{rnk'^) time) a sampling matrix Q G M"^*" and a positive diagonal rescaling matrix 
S G W'' such that, for all i = 1, k: 

1 - < cXi{V^QS) < 1 + 

Lemma 17 is due to Batson, Srivastava, and Spielman [10]. The deterministic 
algorithm promised in the lemma can be found in the (constructive) proof of The- 
orem 3.1 in [10]. In the next subsection, we present an important generalization 
of Lemma 17 to sample two ortho normal matrices simultaneously. Lemma 17 fol- 
lows from Lemma 19 with U = V, so the algorithm that we present in the proof 
of Lemma 19 in the Appendix is the algorithm of Lemma 17. Finally, we restate 
Lemma 17 in a different form, which is the form that we will use it in Section 5.1. 

Corollary 18. Frame the hypothesis in Lemma 17. For those Q, S and Vy G M.'^ : 

\\yy\\l<\\s^^^yy\\l<(^i + \lfj \\yy\\l 

We write [Q, S] = Barrier Sampling I {V, r) to denote such a procedure. 
Remark. An improvement of this result (on the run time) appeared recently by 
Zouzias in [143]. More specifically, in the parlance of Lemma 17, Theorem 12 
in [143] shows that there is a 0{nrk\o^{k) + k'^\og{k)r) deterministic algorithm 
with (l--/^ l|Vy||i < llS^l^^Vyll^ < (l + y^) l|Vy||^. The notation O(-) 
hides log(log(l/e)) and log(log(A;)) factors. The algorithm of Theorem 12 in [143] 
combines the algorithm of Theorem 11 in [143], which we mentioned in the re- 
mark of Section 3.1.4, along with the algorithm of Lemma 17. More specifically: if 
r > 30/;; ln(2A;), run the algorithm of Theorem 11 in [143]; if r < 30A;ln(2/c), select 
f = 30/cln(2A;) columns with the algorithm of Theorem 11 in [143] and then, by 
using the algorithm of Lemma 17, down-sample this to exactly r columns. This 
improvement implies a run time improvement in Theorem 42 of this thesis. 
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3.1.7 Deterministic Sampling with the Barrier Method II 

Lemma 19 below guarantees the existence of a deterministic algorithm for 
sampling columns from two matrices simultaneously. The details of the correspond- 
ing algorithm can be found in the (constructive) proof of that lemma that we give 
in the Appendix. Our algorithm generalizes Lemma 17; in fact, setting U = V in 
our Lemma gives Lemma 17, which is Theorem 3.1 in [10]. The innovation here 
is that we develop an algorithm that samples columns from two different matrices 
simultaneously. 

Lemma 19 (Dual Set Spectral Sparsification). Let V G R"-^^ and U G M"^^ with 
V"'"V = Ifc and U'^U = 1^. Let r is an integer with k < r < n and assume k,i < n. 
One can construct (deterministically, in 0{rn{k'^ + time) a sampling matrix 
Vt G M"^*" and a positive diagonal rescaling matrix S G R''^^ such that: 

afe(V^(]S) > 1 - and ||U^1^S||2 < 1 + 

We write [fi, S] = BarrierSamplingII{V, U,r) to denote such a procedure. 

3.1.8 Deterministic Sampling with the Barrier Method III 

The algorithm of Lemma 20 is similar to that of Lemma 19. The innovation 
here is to control the Frobenius norm of the sub-sampled matrix. A (constructive) 
proof of this result is given in the Appendix. 

Lemma 20 (Dual Set Spectral- Frobenius Sparsification). Let V G M"^'^ with V^V = 
Ifc and A G M^^". Let r is an integer with k < r < n and assume k < n. One can 
construct (deterministically, in 0{rnk'^ + in) time) a sampling matrix Q G M"^*" and 
a positive diagonal rescaling matrix S G R''^'' such that: 

crfc(V^l]S) > 1 - and ||Af]S||F < ||A||^. 

We write [Q, S] = BarrierSamplingIII{V, A, r) to denote such a procedure. 

Remark. Although in Lemmas 19 and 20 we assumed that the matrices are or- 
thonormal, this is not necessary (see Lemmas 71 and 72 in the Appendix). 
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3.1.9 Random Projections 

A classical result of Johnson and Lindenstrauss [85] states that, for any < e < 
1, any set of m points in n dimensions (rows in A G M'"^") can be linearly projected 
into r = 0(log(m)/e^) dimensions while preserving all the pairwise distances of 
the points within a factor of 1 ± e. More precisely, [85] showed the existence of a 
(random orthonormal) matrix R G such that, for all i,j = 1, ...,m, and with 

high probability (over the randomness of R, A(j) denotes the z-th row of A): 

(l-e)||A(,)-A(,)|| < ||A(,)i?-A(,)i?|| < (l + e)||A(,)-A(,.)||. 

Subsequent research simplified the proof of [85] by showing that such an embedding 
can be generated using an n x r random Gaussian matrix R, i.e. a matrix whose 
entries are i.i.d. Gaussian random variables with zero mean and variance [84]. 
Recently, [4] presented the so-called Fast Johnson-Lindenstrauss Transform which 
describes an R such that AR can be computed fast. In this thesis, we will use a 
construction by Achlioptas: [1] proved that a rescaled random sign matrix, i.e. a 
matrix whose entries have i.i.d values ±l/-\/r with probability 1/2, satisfy the above 
equation. We use such random projection embeddings in Section 6.2. Here, we 
summarize some properties of such matrices that might be of independent interest. 
It is particularly interesting to apply such embeddings in short-fat matrices with 
orthonormal columns. Sarlos is credited for this idea in [122]. 

Lemma 21 (See Appendix for the Proof). Let A G M'"^" has rank p, k < p, and 
< e < i. Let R e W''' is a rescaled random sign matrix constructed as we 
described above with r = Coke'"^ , where Cq is a (sufficiently large) constant. 

1. For all i = 1, k and w.p. 0.99: 1 - e < (Tiiy^R) < 1 + e. 

2. For any X G M"'^", Y G M"^^ and r: E [||XY - XRR^Yfp] < ^\\X\\l\\Y\\j,. 

3. For any X G M"'''" and r: E [||Xi?|||] = ||X|||, and Var[\\XR\\F] < 2||X|||/r. 
I W.p. 0.99: \\{VjRr-{VlRf\\ < 3e. 

5. For any X G M'"^" and w.p. 0.99: \\XR\\f < ^(1 + e)||X||F. 

6. W.p. 0.97: Ak = ARiVjR)+Vl + E; E G M™^" with \\E\\f < 4e||A - Afe||^. 
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3.1.10 Subsampled Randomized Hadamard Transform 

We give tlie definitions of tlie "Normalized Walsh-Hadamard" and tlie "Sub- 
sampled Randomized Hadamard Transform" matrices as well as a few basic facts 
for computations with such matrices. We use these matrices in Section 5.3. 

Definition 22 (Normalized Walsh-Hadamard Matrix). Fix an integer m = 2^ , 
for p = 1,2,3,.... The (non-normalized) m x m matrix of the Hadamard- Walsh 
transform is defined recursively as follows: 

-n-m 

Hm/2 ~"Hm/2 

The m X m normalized matrix of the Hadamard- Walsh transform is equal to 

H = m^^Hm. 

Definition 23 (Subsampled Randomized Hadamard Transform (SRHT) matrix). 
Fix integers r and m = 2^ with r < m and p = 1,2,3, .... A SRHT is an r x m 
matrix of the form 

e = S^n^DH, 

where 

• Kg M™-^™ is a normalized Walsh-Hadamard matrix. 

• D G M™^"^ is a diagonal matrix constructed as follows: each diagonal element 
is a random variable taking values {+1, —1} with equal probability. 

• Q E MJ^^^ is a sampling matrix constructed as follows: for j = 1,2, ...,r i.i.d 
random trials pick a vector Bj from the standard basis of M™ with probability 
^ and set the j-th column of Q equal to that vector. 

• S G R*"^^ is a rescaling (diagonal) matrix containing the value 

Proposition 24 (Fast Matrix- Vector Multiplication, Theorem 2.1 in [5]). Given x G 
M™ and integer r < n, one can compute the product 6x with at most 2mlog(r + 1)) 
operations. 



with 



+ 1 
+1 
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Lemma 25 ([4], Lemma 3 in [48]). Let U G M™^'^ has orthonormal columns. Let 
(DHU)(-^j denotes the i-th row of the matrix DHU G M'"^'^ and Si denotes the 
probabilistic event that \\ (DHU)^^) II2 < 2fc log^omfc) ^^^^^ randomness ofD): 

Pr[^i U £"2 . . . U < 0.95. (3.1) 



It is interesting to consider applying the SRHT to orthonormal matrices. 
Drineas et al are credited for this idea in [48]. Let U G M™^^ has orthonormal 
columns and m ^ k. The following lemma studies the singular values of the matrix 
OU. This result is very similar with the result of Lemma 13 with the only difference 
being the fact that before applying the SubspaceSampling method on the rows of 
U we pre-multiply it with a randomized Hadamard Transform, i.e. we apply the 
SubspaceSampling on the rows of the matrix DHU so, the matrices Q and S in 
Definition 23 can be obtained by a special application of SubspaceSampling: 

[Q, S] = SubspaceSamplingiDHU , 



21og(40A;m)' 

Since /3 < 1, we need to specify the sampling probabilities p^'s in Definition 12: 
1 1 II (DHU)(,) ||2 II (DHU)(,) Hi 

Pi ~ — ^ ~ H • 

m ~ 21og(40A;m) k k 

The inequality in this derivation is from Lemma 25. We formalize this discussion in 
Lemma 26, which can be viewed as the analog of Lemma 13. We should note that 
a mild improvement of Lemma 26 can be found in [137]. 

Lemma 26. Let U G R™^'' with m > k and U^U = K. Let B = ^, ,]^, and 

' 2 log(40Km.) ' 

4:k\n{2k/6)/(3 < r < m. Let [fl,S] = SubspaceSamplingiDHU , 2iog(40fcm) ' ^) ^'^^ 
Pi = 1/m in Definition 12. Then, for all i = 1, k, w.p. at least 0.95 — 6: 



8k\n{2k/5) hgjAOkm) ^ ^2(uTj3ThT^s) = ^.^(eU) < 1-' ■ I ^k\ni2k/6)\ogiA0km) 



Remark. From Lemma 1, we restate the latest result as: For any vector y G 



'8A;ln(2A;/5)log(40A;m)^„^^^_„2 ^ uc.T^r,.u2 / /1 , . /8'^ln(2A;/(5) log(40A;m)^„^^^_ 2 



^ ' ' ^)l|Vy||^ < lie^Vyll^ < (1+W ' : ^)||Vy 



12- 
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Spectral norm = 2) 


Frobenius norm = F) 


r = k 


« = x/f [34] 


a = Vk + l [35] 


r > k 


a = ^ (Section A.12) 


« = \/l + ^ [39] 



Table 3.1: Lower bounds ||A - n|,^^(A)|U|| A - A^H^ > a. 



3.2 Prior Work: Low-rank Column-based Approximation 

We start with a discussion on lower bounds for low-rank column based matrix 
approximation. The above table provides a summary on lower bounds for the ratio 

l|A-4..(A)||g ^ . 
IIA-Afcll^ - 

If this ratio is greater or equal to a specific value a, this means that there exists 
at least a matrix C such that no algorithm can find an approximation with a fac- 
tor a which is strictly better than a. An algorithm is called near-optimal if it - 
asymptotically - matches the best possible approximation bound. Notice that the 
approximation ratio is presented with respect to the best rank k low-rank approxi- 
mation computed with the SVD; this case is certainly interesting, but replacing A^ 
with Copt A would have implied more meaningful information for the hardness 
of approximation of this column selection problem. Here, Copt satisfies: 

Copt = argmin||A -n^^^(A)[|5. 

Lower bounds of the above form were available for both spectral and Frobenius 
norm and r = k. When r > k, a. Frobenius norm bound was given in [35] but a 
spectral norm bound didn't appear in prior work. Our Theorem 75 in the Appendix 
contributes a new lower bound for the spectral norm case when r > k. It is worth 
noting that any lower bound for the ratio ||A — CC'^A||^/||A — A^H^ also implies 
a lower bound for ||A — 11^ ^(A)||^/|| A — Afc||g; the converse, however, is not true. 
See Table 3.1 for a summary of the lower bounds. 



35 





Spectral norm = 2) 


Frobenius norm = F) 


r = k 


a = ^Ak{n - k) + I [66] 


a = ^k + l [34] 


k < r = o{k \og{k)) 






r = n{k\og{k)) 




+ [49, 34, 39] 



Table 3.2: Best Available algorithms (prior to our results) for the equa- 
tion: II A - n^_^(A)||g < all A - Afcll^. We offer a = {ypj^^ for ^ = 2, r > A; 
(Theorem 32); and a = ^/l + {k/r) for ^ = F, r > lOA; (Theorem 36). 



The Frobenius norm case (^ = F) 

We summarize prior work by presenting algorithms with accuracy guarantees 
a with respect to the equation 

||A - CC+A||p < ||A - n^^;-,(A)||p < a||A - Afc||i.. 

The r = k case. Theorem 8 in [34] describes a deterministic 0{knm^\og{m)) 
algorithm with approximation: 

||A - CC+A||^ = ||A - n^,fe(A)||i. < VkTl\\A - Ak\\F. 

This matches the lower bound presented in [35]. The algorithm of [34] is based 
on the method of volume sampling that we discussed in Section 3.1.2. [34] also 
presented a faster randomized algorithm achieving a (1 + e)\/k + 1 approximation 
(in expectation), running in 0(nm log(n)/c^e~^ + nlog'^(n) ■ /c^e~^ log (/ce"^ log(n))) 
time (See Theorem 9 in [34]): 

||A - CC+A||^ = ||A - n^,fc(A)||^ < (1 + e) y^TT||A - A^H^. 

Here e is given as input and can be made arbitrary small 0<e<l/2. It would have 
been interesting though to consider setting e of the order k such that to improve 
the run time of the algorithm but, to our best understanding, this is not possible, 
since Theorem 9 in [34] is based on a random projection type result from [101] that 
is based on the construction of Achlioptas in [1], which breaks unless e < 1. 
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The k < r = o{klog{k)) case. For this range of values of r, we are not familiar 
with any available algorithm in the literature. 

The r = Q{klogk) case. When r is asymptotically larger than k\og{k), there are 
a few algorithms available that we summarize below. Recall that for general r > k, 



the lower bound for Frobenius norm approximation is d = -i / l + j-. All algorithms 



optimal up to a factor 0(log(A;)). Our Theorem 36 presents an algorithm which 
asymptotically matches this lower bound; in fact it is optimal up to a constant 20. 

[49] presented a randomized algorithm that samples columns from A with 
probabilities proportional to the Euclidean norms of the rows of V^. In fact, the 
algorithm of [49] applies the SubspaceSampling method that we described in Section 
3.1.4 on the matrix of the top k right singular vectors A. For fixed A, fc, and 
r = Q{klog{k), [49] describes a 0{mnmin{'m,n} + rlog(r)) time algorithm that, 
with constant probability, guarantees 



[122] showed how to get the same result as [49] but with improved running time 



a rank k matrix that approximates A^ and can be computed in o{'mnmin{m,n}) 
This idea of Sarlos [122] is similar with our Lemma 5 in Section 2.2. 

In [39], the authors leveraged adaptive sampling (see Section 3.1.2) and volume 
sampling (see Section 3.1.3) to design a 0{mnk'^\ogk + r log(r)) time randomized 
algorithm that, with constant probability, obtains approximation error: 



Notice that this method, for arbitrarily e > 0, needs r = 0{k^\ogk + ke~^) 
columns to obtain ||A — n^^(A)||i;' < (1 + e) ||A — Ajfc||i?, whereas [49] needs r = 
0(A;log(A;)/e^). Finally, it is possible to combine the fast volume sampling ap- 
proach in [34] (setting, for example, e = 1/2) with 0(logA;) rounds of adaptive 



of this paragraph offer approximations of the order 





TiYk) + 0{nk + r log(r)), where G M"^^ contains the right singular vectors of 




37 



sampling as described in [39] to achieve a relative error approximation using r = 
O {k log k + ke~^) columns in time O {mnk'^ log(r;,) + nk^ log'^(?7,) log {k log(r;,))) . Also, 
a (1 + e)-error with r = 0(A;logA; + ke~^) columns is possible by combining [49] 
with 0(logA;) rounds of adaptive sampling as described in [39]. The run time of 
this combined method is 0(mnmin{m, n} + rlog(r)). Theorem 36 of our work 
gives (1 + e)-error with 0{k/e) columns in 0{mnk + nk^ + nlog(r)). Notice that 
our result considerably improves both running time and approximation accuracy of 
existing algorithms selecting r > k columns. 

The spectral norm case {C, = 2) 

We summarize prior work by presenting algorithms with accuracy guarantees 
a with respect to the equation 

||A - CC+AII2 < ||A - n2_,(A)||2 < a||A - Afclls. 

The r = k case. The strongest bound for r = k emerges from the Strong Rank 
Revealing QR (RRQR) algorithm of [66] (specifically Algorithm 4 in [66]), which, 
for / > 1, runs in 0(mn/c log^n) time, and constructs C G M"*^'^: 

||A - CC+AII2 = ||A - ^,(A)||2 < VPkin-k) + l\\A - A,h- 

One can also trivially get an aA//7^^-approximation in the spectral norm from any 
algorithm that guarantees an a-approximation in the Frobenius norm, because, for 
any matrix X, ||X||2 < ||X||p < ^rank(X) ||X||2; so, 

||A - CC+AII2 < ||A - CC+A||^ < a||A - A^H^ < a^p-k\\A - Akh- 

||A-n2^,(A)||2 < ||A-ng,,(A)||2 < ||A-ng,,(A)||^ < aWA-A.y < a^/i^\\A-Akh. 

The r > k case. We are not aware of any bound applicable to this domain other 
than those obtained by extending the available Frobenius norm bounds as indicated 
above. [120] presents an additive-error spectral norm algorithm but only for a special 

II Alp 

case of matrices where the ratio rj = , . is "small". More specifically. Theorem 
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1.1 in [120], for fixed A, < 5 < 1, < e < 1, constructs a matrix C G R"^'''' 
with r = r2(^log(^)) as C = AdditiveSampling{A, r) (see Section 3.1.1 for the 
description of AdditiveSampling) such that w.p. 1 ^jy: 



(The term 11^ ^(A) follows after reorganization of the notation in [120]. The left 
inequality follows by the optimality of 11^ ^(A)). 

Column-based approximations with the Rank-Revealing QR 

Finally, we comment on a line of research within the Numerical Linear Algebra 
community that, although does not primarily focus on low-rank column based matrix 
approximations, does provide several algorithms for such a task. This line of research 
focuses on spectral norm bounds for the r = k case. We start with the definition of 
the so-called Rank Revealing QR (RRQR) factorization. 

Definition 27. (The RRQR factorization) Given a matrix A G R^^^ (m > n) 
and an integer k (k < n), assume partial QR factorizations of the form: 



where Q G i?"*^" is an orthonormal matrix, R G is upper triangular, Rn G 

^fcxfc^ i?i2 £ Rf'xi-n-k)^ G R{ri--k)y<{n-k) ^ ^ ^raxn ^ permutation matrix. 

The above factorization is called a RRQR factorization if it satisfies 



where pi{k,n) and p2{k,n) are functions bounded by low degree polynomials in k,n. 

The work of Golub on pivoted QR factorizations [63] was followed by much 
research addressing the problem of constructing an efficient RRQR factorization. 
Most researchers improved RRQR factorizations by focusing on improving the func- 



A - Ill,iA)h < ||A - Ul,{A)h < ||A - A,h + e||A|h. 




Pi{k,n) 

0-fc+l(A) < CT, 



max 



(Rii) <(rk{A) 

{R22) <P2ik,n)ak+i{A) 



39 



Method 


Reference 




Time 


Pivoted QR 


[Golub, '65] 


[63] 




0{mnk) 


High RRQR 


[Foster, '86] 


[56] 


y/n{n - A;)2"-'^ 


0{mn'^) 


High RRQR 


[Chan, '87] 


[26] 


^n{n - A;)2"-^ 


0{mn'^) 


RRQR 


[Hong/Pan, '92] 


[80] 


^yk{n -k) + k 


- 


Low RRQR 


[Chan/Hansen, '94] 


[27] 


v/(A; + l)n2'=+i 


0{mn'^) 


Hybrid-I 


[Chandr./Ipsen, '94] 

L / J- ' J 


[28] 


^{k + l){n-k) 


- 


Hybrid-H 




28 

L J 


^J{k + l){n-k) 


- 


Hybrid-HI 




[28] 

L J 


^{k + l){n-k) 


- 


Algorithm 3 


[Gu/Eisenstat, '96] 


[66] 

L J 


^Jk{n -k) + l 


- 


Algorithm 4 




[66] 


ljpk{n -k) + l 


0{kmn log^(n)) 


DGEQPY 


[Bischof/ Orti, '98] 

L / " J 


[11] 


0{^{k + iy{n-k)) 


_ 


DGEQPX 




[11] 


0{^{k + l){n-k)) 


- 


SPQR 


[Stewart, '99] 


[132] 






Algorithm 1 


[Pan/Tang, '99] 


[116] 


0{^{k + l){n-k)) 




Algorithm 2 




[116] 


0{^{k + iy{n-k)) 




Algorithm 3 




[116] 


0{^{k + mn-k)) 




Algorithm 2 


[Pan, '00] 


[115] 


0{^k{n-k) + l) 





Table 3.3: Deterministic Rank-Revealing QR Algorithms. 



tions n) and P2(^, n) in Definition 27. Let H^ G M"^'^ denote the first k columns 
of the permutation matrix H. Then, if C = AH^: 

||A - CC+AII5 = ||A - h2^,(A)||2 = \\R22U 

for both ^ = 2, F. Thus, for ^ = 2, it follows that 

||A - CC+AII2 = ||A - H2^,(A)||2 < p2(fc,n)afc+i(A) = P2{k,n)\\A - A^l^, 

i.e., any algorithm that constructs a RRQR factorization with provable guarantees 
also provides the same provable guarantees for column-based matrix approximation. 
See Table 3.3 for a summary of existing results. A dash implies that the running 
time of the algorithm is not explicitly stated in the corresponding citation or that 
it depends exponentially on k. In addition, m > n and / > 1 for this table. We 
should note that our Theorem 37 gives an algorithm which is (up to a constant 4) 
as accurate as the best algorithm of this table but faster by a factor 0(1/ log{k)). 
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3.3 Prior Work: Coresets for Least-Squares Regression 

We quickly review the definition of tlie least-squares problem. The input is 
a matrix A G M™^" (m ^ n) of rank p = n and a vector b G M™; the output 
is a vector x^pt G M" that minimizes the distance ||Ax — b||2, over all x. If no 
additional constraints are placed on x, Xop* = A^b, which can be computed in 
0{mn'^) time [61]. When x is forced to satisfy additional constraints, for example, 
all elements of x are nonnegative, then, the regression problem does not have an 
analytical solution and algorithms can be as complex as the constraints. The results 
of this section along with our results from Chapter 5 are summarized in Table 3.4. 

Coresets for Constrained Regression. Dasgupta et. al. [32] presented a ran- 
domized algorithm^ that, with probability 0.5, constructs a (1 + e)-coreset of size 
r = ^^^{n ln(^) + ln(200)), in time 0(mnmin{m, n} + Tc). Tc is the time needed to 
compute an exact solution of a constrained regression problem with inputs C G W^"^ 
and be G W, which contain f = 82, 944n(nln(288) + ln(200)) (rescaled) rows from 
A and b, respectively. The constraints in this smaller problem are the same as those 
in the original problem. depends on the specific constraints on x. For example, if 
no constraints are placed on x, = 0{fn'^) via the SVD. The algorithm from [32] 
runs in two stages. In the first stage, 

f = 82, 944n(nln(288) + ln(200)) 

rows from A, b are selected to form C G M''^", be G W-^ then, in the second stage, 
exactly 

1296n, , ,36, , , 
r = — (n ln(— ) + ln(200)) 

e"' e 

rows are selected to construct the coreset C G R*"^" and be G M^. 

In both stages, the sampling of the rows is done with the same probabilistic 
technique. More specifically, for a fixed set of probabilities pi,...,pm, a diagonal 
matrix Q^- G R™-^*" is constructed as follows (j = 1,2 for the two stages). For 



■^This algorithm actually provides coresets for the more general problem of minimizing ||Ax — 
b||p. for p = I, 2,3, over all x (possibly constrained) but since our focus is on p = 2, we do not 
elaborate further on the p 7^ 2 case. 
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Year 


Ref. 


coreset size r = 


Time=0(x),x = 


5 


C/C 


2008 


[321 

J 


1296n(nln(^) +ln(200))/e^ 




0.5 


Y/Y 


2011 


Thm 42 


225(n + l)/e2 


mTi^ + mm? / 





Y/Y 


2011 


Thin 43 


36(n + 1) \n(2(n + 1) /6n) /e^ 


mn^ + r log(r) 




Y/Y 


2006 


[471 

L ' J 


3492^2 ln(3/5i)/e2 


mn^ _l_ ^ log(r) 




N/Y 

/ 


2008 




3200nVe2 


rriTi^ _l_ rp log(r) 


0.3 


N/Y 

/ 


2008 


[50] 


Ciu \og{n)/e'^ 




0.3 


N/Y 


2006 


[122] 


C2n log(n)/e 


mn log(m) + to log(n) / e 


0.7 


N/N 


2011 


[102] 


csn/e 


mnlog(mj + to/e 


0.7 


N/N 


2007 


[48] 


max{,^i , 40r2 \n{40mn) /e} 


mn log(r) 


0.2 


N/N 


2007 


[48] 


max{c4(1182d + 98^), 60r2/e} 


mn log(mrti) 


0.2 


N/N 


2008 


[119] 


({l±flitl)'lO(n + l)^ 


mn log(r) 


0.1 


N/N 


2011 


Thm 44 


72(n + 1) ln(2(n + l)/*2)f2/«^ 


mn log(r) 


52 


N/Y 



Table 3.4: Coresets/"Coresets" for (Constrained) Regression ||Ax — b||2; 
A G M™""". C/C abbreviates Coreset/Constraint and N/Y (NO/YES) 
implies "Coreset" for Constrained Regression, p = rank(A) = n. ci, C2,.--? 
are (unspecified) constants. A dash means that the time can not be 
specified exactly. 6 is the failure probability. ^2 = log(40(n + l)m). ^1 = 
482nln(40mn)ln(1002nln(40mn)); to = nMog^(m); ti = "''^H") (in(m) +n). 



i = 1, ...,m the i-th diagonal element of Q^^ is equal to l/^/Pi with probability Pi 
and with probability 1 — Pi (Bernoulli process). In the first stage {j = 1), the 
probabilities are 

. r, ll(UA)(.)||i ., 

Pi = mmjl, r|; 

n 

in the second stage (j = 2), the probabilities are refined as: 

Pi = min{l,max{pi, -w^r}}, 

where z = [2I, 2;^,] = (Axop^ — b)'^ G M}^"^ , with icopt = arg miux | [QiAx — Q]^b| [2. 
The final coreset C G R''^", b^ G W corresponds to the non-zero rows and elements 
from Q2A and Q2b, respectively. We slightly abused notation, since, in both stages, 
the actual rows that are sampled are not exactly f and r. It can be proved thought 
that - in expectation - f and r rows are sampled from A, b in the first and the second 
stage, respectively. 
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Coresets for Unconstrained Regression. When no constraints are placed on 
X, there are a few available algorithms in the literature that we present below. 

First, Theorem 3.1 in [47] describes a randomized algorithm that, with prob- 
ability at least 1 — 6, for any < d < 1, constructs a (1 + e)-coreset of size 
r = 3492n^ln(l/5)/e^ in time 0{mn'^ + rlog(r)). The algorithm of [47] is as 
follows: (i) for i = l,...,m, compute probabilities Pi that satisfy three conditions 



(let Ul = [ui, u^] G M"^'", u, G M", and (AA+b - b)^ = [zi, z^] G Ri^"*): 



(ii) sample r rows from A in r i.i.d. trials, where in each trial the i-th row is 
sampled with probability pi (an appropriate rescaling also applies to the rows of 
A and b). Notice that this algorithm requires the computation of the matrix Ua 
from the SVD and the optimum solution vector x^p^ = A~'^b. The factors 1/3 in 
the above conditions can actually be relaxed to any values < /3i,/32,/33 < 1, with 
a change in the sampling complexity by a multiplicative factor min{/3^, /3^}/9. 
An interesting open question is whether one can approximate these probabilities in 
o(mn min{m, n}) time. Magdon-Ismail in [100] made progress towards this direc- 
tion. Unfortunately, the algorithm of [100], which is based on random projections, 
returns probabilities that satisfy only the first condition, so, in the parlance of Theo- 
rem 3.1 in [47], [100] does not provide any useful result. As we will see below though, 
sampling with probabilities satisfying only the first condition gives a (1 + e)-coreset 
with comparable size but only with constant probability. Notice that the algorithm 
of this paragraph fails with arbitrarily small probability 6 and r = O (In (1/5)). 

Second, the first algorithm of Theorem 5 in [50] presents a randomized al- 
gorithm that, for any < /3 < 1, w.p. 0.7, constructs a (1 + e)-coreset of size 
r = 3200n^//3e^ by running [fl,S] = SuhspaceSampling{\] P,^) as we described 
in Section 3.1.4. This algorithm needs 0{mn'^ + r log(r)). Inspecting carefully the 
SuhspaceSampling method, we see that the probabilities that are used to sample 
columns from must satisfy pi > which is the first of the three conditions 

that we saw in the previous paragraph. One can compute these probabilities with 




1 



3 ||x, 



■opt 1 1 2 



2 ' 
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/3 = 1 through the SVD. [100] showed how to approximate these probabihties in 
o{mn'^) using random projections^. An important observation about the algorithm 
in Theorem 5 in [50] is that it doesn't need access to b in order to compute the 
coreset. This is useful, because the same coreset can be used for multiple b. 

Third, the second algorithm of Theorem 5 in [50] presents a randomized al- 
gorithm that, with constant probability, constructs a (1 + e)-coreset of size r = 
0(n log(n)//3e^) in time O(mn^). Here, the sampling probabilities are the same as 
in the previous paragraph; the sampling though is done with a slightly different 
way (without replacement). The algorithm of this paragraph is the best available in 
the literature with respect to the coreset size and running time. Again, the result 
from [100] implies a running time of o{mn^) to approximate the probabilities. 

"Coresets" for Unconstrained Regression. If we allow the coreset to be lin- 
ear combinations of the rows of A and b, there are several methods that construct 
(1 + e)- "coresets" in o(mn^) time. Sarlos in [Eqn. 2, Theorem 12, [122]] gives an 
algorithm to construct a (1 + e)- "coreset" of size r = 0{n\og{n) / e). [122] uses 
the Fast Random Projection method of [4]. Along the lines in [122], Avner and 
Zouzias in [Eqn. 3.3, Theorem 3.3, [102]] showed that r = 0{n/e) rows are enough; 
the construction in [102] uses the method of [4] as well. Drineas et. al in [48] 
employ the Subsampled Randomized Hadamard Transform that we described in 
Section 3.1.10 to design faster algorithms for constructing (1 + e)- "coresets" . More 
specifically, [Algorithm 1, Theorem 2, [48]] describes a method which does so with 
coreset size r = max{48^nln(40mn) ln(100^nln(40n(i)),40nln(40mn)/e}. The dif- 
ference of this algorithm with ours (Theorem 43) is that our "coreset" works for 
arbitrary constraint regression. Moreover, [Algorithm 2, Theorem 3, [48]] requires 
r = max{c(118^(i + 98^),60n/e}, for a constant c; this algorithm is based on a 
(sparse) random projection embedding. Along the same lines as [48] - by using a 
different matrix than the Hadamard - Rokhlin and Tygert in [Lemma 2, [119]] de- 
scribe an algorithm which constructs a "coreset" of size r = (|y^^|23})^10(?^ + 1)^ 
w.p. 0.9. Finally, [8] presents experiments with the randomized method of [48]. 

■^The algorithm that approximates the probabihties is randomized and, for any < (5 < 1, 
does so with probabihty 1 — 5 in time 0{mn\og{m) + mnlog(n) log(|) + i^^^ log(i))- So, for 
log(m) = o{n), the total run time is o{rmi?'). 
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Nonnegative Least Squares Regression Problems 

Notice that so far we haven't discussed algorithms and running times for solv- 
ing the regression problem per se. Here, we elaborate on a popular constrained 
regression problem called NNLS. The purpose of this section is to give further mo- 
tivation that coresets for constrained regression problems are indeed important in 
several applications. The Nonnegative Least Squares (NNLS) problem is a con- 
strained least-squares regression problem where the variables are allowed to take 
only nonnegative values. NNLS is a quadratic optimization problem with linear 
inequality constraints. As such, it is a convex optimization problem and thus it 
is solvable (up to arbitrary accuracy) in polynomial time [12]. The motivation for 
NNLS problems in data mining and machine learning stems from the fact that given 
least-squares regression problems on nonnegative data such as images, text, etc., it 
is natural to seek nonnegative solution vectors. (Examples of data applications are 
described in [29].) NNLS is also useful in the computation of the Nonnegative Matrix 
Factorization [89], which has received considerable attention in the past few years. 
Finally, NNLS is the core optimization problem and the computational bottleneck in 
designing a class of Support Vector Machines [125]. Since modern datasets are often 
massive, there is continuous need for faster, more efficient algorithms for NNLS. 

Existing Algorithms. We briefly review NNLS algorithms following the exten- 
sive review in [29]. Such algorithms can be divided into three general categories: (i) 
active set methods, (ii) iterative methods, and {Hi) other methods. The approach of 
Lawson and Hanson in [96] seems to be the flrst technique to solve NNLS problems. 
It is a typical example of an active set method and is implemented as the function 
Isqnonneg in Matlab. Immediate followups to this work include the technique of Bro 
and Jong [24] which is suitable for problems with multiple right hand sides, as well 
as the combinatorial NNLS approach of Dax [33]. The Projective Quasi- Newton 
NNLS algorithm of [88] is an example from the second category. It is an iterative 
approach based on the Newton iteration and the efficient approximation of the Hes- 
sian matrix. Numerical experiments in [88] indicate that it is a very fast alternative 
to the aforementioned active set methods. Finally, the sequential coordinate-wise 
approach of [58] and interior point methods [121] are also useful. 
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A Faster NNLS Algorithm. We show how the "coreset" construction algorithm 
of Theorem 43 can be used to speed up existing NNLS algorithms. Recall that the 
the construction of the "coreset" takes T^or = 0(mn log(n log(m)/e^)). After this 
step, we can employ a standard NNLS solver on the smaller problem. The compu- 
tational cost of the NNLS solver on the small problem is denoted as T^NLsif^^n), 
with r = 0(n ln(n) log(nm)/e^). Compare this with T/vArL5'(m, n) which is the time 
needed to solve the problem exactly. T^NLsij^fT) cannot be specified since theoret- 
ical running times for exact NNLS solvers are unknown. In the sequel we comment 
on the computational costs of some well defined segments of some NNLS solvers. 
NNLS is a convex quadratic program: 

min x^Qx — 2q^x, 

a;eIR",x>0 

where Q = A^A G M"^*^ and q = A^b G M". Computing Q and q takes O(mn^) 
time, and then, the time required to solve the above formulation of the NNLS prob- 
lem is independent of m. Using this formulation, our algorithm would necessitate 
Tcor time for the computation of the "coreset", and then Q = C"'"C and q = C"'"bc 
can be computed in Tmm = 0{rn'^) time (MM stands for Matrix Multiplication); 
given our choice of r, this implies Tmm = 0(n^ log(nm)/e^). Overall, the standard 
approach would take 0{mn'^) time to compute Q, whereas our method would need 
only Tcor + Tmm time for the construction of Q. Note, for example, that when 
m = 0{n'^) and treating e as a constant, Q can be computed 0{n/\og{nm)) faster. 

On the other hand, many standard implementations of NNLS solvers (and 
in particular those that are based on active set methods) work directly on A and 
b. A typical cost of these implementations is of the order O(mn^) per iteration. 
Other approaches, for example the NNLS method of [88], proceed by computing 
matrix-vector products of the form Au, for an appropriate n-dimensional vector u, 
thus cost typically 0{mn) time per iteration. In these cases, our algorithm needs 
again T^or preprocessing time, but costs only 0{rn^) or 0{rn) time per iteration, 
respectively. Again, the computational savings per iteration are 0{n/ log{nm)). 
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Reference 


Description 


Dimensions 


Time = 0[x\x = 


Error 


Folklore 


RP 


0(log(m)/e2) 


mn [e~^ log(m) / log(d)] 


1 + e 


[40] 


Exact SVD 


k 


mn min{m, n} 


2 


Theorem 47 


RS 


0{k\og{k)le^) 


rnnke"^ + to 


3 + e 


Theorem 49 


RP 


Oikje') 


mn \e~'^k/ log(n)] 


2 + e 


Theorem 50 


Approx. SVD 


k 


mnke"^ 


2 + e 



Table 3.5: Provably Accurate dimensionality reduction methods for fc-means 
clustering. RP stands for Random Projections, similarly for RS and Random 
Sampling. The technique in the second row of the table is deterministic; 
the others fail with, say, a constant probability. In the Random Projection 
methods the construction is done with random sign matrices and the mailman 
algorithm (see Section 6.2). to = klog{k)e~'^ log{klog{k)e^^). 

3.4 Prior Work: Feature Selection for /c-means Clustering 

Dimensionality reduction encompasses the union of two different approaches: 
feature selection, which embeds the points into a low-dimensional space by selecting 
actual dimensions of the data, and feature extraction, which finds an embedding 
by constructing new artificial features that are, for example, linear combinations of 
the original features. Let A be an m x n matrix containing m n-dimensional points 
(A(i) denotes the i-th point of the set), and let k be the number of clusters. We say 
that an embedding / : A — t- R'' with /(A(j)) = C(i) for all i = l,...,m and some 
r < n, preserves the clustering structure of A within a factor 0, for some > 1, 
if finding an optimal clustering in C G M"^^^ and plugging it back to A G R"*^" is 
only a factor of worse than finding the optimal clustering directly in A. 

Prior efforts on designing provably accurate dimensionality reduction methods 
for /c-means include: (i) random projections, where one projects the input points into 
r = 0(log(m)/e^) dimensions such that, the clustering structure is preserved within 
a factor of = 1 + e; and (ii) the Singular Value Decomposition (SVD), where one 
constructs C = U^Sfc G R™^'^' such that the clustering structure is preserved within 
a factor of = 2. We summarize prior work and our results in Table 3.3. Finally, 
other techniques, for example the Laplacian scores [79] or the Fisher scores [55], 
are very popular in applications (see also [70]). However, they lack a theoretical 
analysis; so, a discussion of those techniques is beyond the scope of this thesis. 
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3.5 Similar Studies 

There are several more related problems that received considerable attention 
over the last decade or so. We discuss some of these problems below. The presen- 
tation is careful to highlight the differences and the similarities with the algorithms 
and the results presented in this thesis. Other sources that give the big picture of 
this area include the surveys in [73, 86]. 

Matrix Multiplication 

Approximate Matrix Multiplication studies quick approximations of the prod- 
uct AB, given A, B. Let A e M™''" and B G W^'p. Then, a standard BIAS imple- 
mentation takes 0{mnp) time to compute the product AB exactly. Algorithms that 
approximate the product AB in o{mnp) time are of considerable interest. An idea 
that received considerable attention is to do that by first sampling a small subset 
of the columns of A and the corresponding rows from B, and then approximate the 
product AB with AB; Let r > is the number of sampled columns and rows from 
A and B, respectively. Then A G M"^^*" and B G W^"- are the subsampled matrices. 
Notice that the product AB takes 0{mrp) time; so, one needs to choose r as well as 
the technique to construct A, B such that the overall running time is o{mnp). There 
are several such approaches in existing literature; we describe some of them below. 
[41, 43] described a randomized sampling based algorithm with approximation: 



E 



|AB- ABjll 



<-||A|||||B||2, 
r 



The method of [41, 43] proceeds as follows. First, compute probabilities: 

l|A«||2||B(,)||2 



Pi 



E"=il|A(^')||2||B 



(i)ll2 



Then, construct a sampling matrix Q G R"^'' and a rescaling matrix S G W^^' as 
follows. Initially, Q = 0„xr and S = Orxr- Then, for every column j = 1, ...,r of Q, 
S, independently, pick an index i from the set {1,2, ...,n} with probability Pi and 
set = 1 and Sjj = 1 / y/PiV. Finally, set A = Af2S and B = S"'"r2"'"B. It is worth 
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noting that constructing 1], S by using any set of probabilities pi, ...,Pn gives: 



In fact, [41, 43] select the appropriate probabilities to get good bounds for the 
residual error. In the above approach, computing the probabilities takes 0{mn+np), 
sampling with replacement necessitates 0(n + rlog(r)), and computing the product 



The approach that we outlined above works for matrices of arbitrary dimen- 
sions and this is actually quite impressive. To our best knowledge, the above ap- 
proach is the best sampling-based method for approximate matrix multiplication 
with respect to the Frobenius norm. Notice thought that this algorithm is random- 
ized. An interesting open question is whether there exists a deterministic algorithm 
with comparable approximation bounds. Although such a deterministic algorithm 
might be too costly for approximating the product AB in o(mnp), we believe that 
such a result will open new directions in designing deterministic algorithms for other 
problems involving subsampling, e.g. low-rank matrix approximation and /c-means 
clustering. 

For approximations with respect to the spectral norm we refer the reader to 
[120, 102, 100, 81]. We comment carefully on the best of these results, which is 
statement (ii) in Theorem 3.2 in [102]. The setting is similar with the example that 
we outlined above for the Frobenius norm case. The construction of A, B is done 
with the same algorithm as well. The results for the spectral norm though do not 
hold for arbitrary matrices. It is required that the matrices have low stable rank. 
We define the stable rank of a matrix A as 



sr(A) = 




Assume that sr(A),sr(B) < p. Let e > and r = fi(plog(p/e^)/e^). Then, 
statement (ii) in Theorem 3.2 in [102] argues that with probability at least 1 — pp^^^^-) : 

||AB - ABII2 < e||A]|2||B]|2. 




AB takes 0{mrp). Overall, this approaches runs in 0{mn + np + mpr + r log(r)). 
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Deterministic Symmetric Multiplication. Notice that the algorithm just men- 
tioned above is also randomized. Deterministic approaches for the same problem 
will be particularly important. The celebrated deterministic sparsification result 
of [10] gives such a bound for short-and-fat matrices with orthonormal rows. We 
stated this result in Section 3.1.6. This bound can be generalized to short-and-fat 
matrices with any set of rows (with an additional 0{nk'^) cost for the SVD of the 
matrix - see below), but for now let us focus on the simple case. Let V G M"^*^ with 
n > k and V"'"V = Ifc. Let r > k. Then, deterministically in 0{nk'^r): 



l9k, 



iV^fiSS^fi^V - V^V||2 < \/ — IIVII2IIVII2. 

r 



To see this, recall Lemma 17. Take squares on the right hand side of the equation 
in that lemma and observe that: 



9k 



a/(V^fiS)<l + ^/-. 




The result follows by using the first and the third statements of Lemma 1 with 
e = -sj^- To extend this to arbitrary A G W^^'^ with n > k, is suffices to compute 
the SVD of A and apply the above algorithm to the matrix containing the left 
singular vectors of A. More specifically, let the SVD of A is A = UaSaVJ with 



UaG 



pn.xfc 



ikxk 



and Va G M^'^'^. Now, consider the following derivations: 



A^1]SS^(]^A- A^AIU 



< 



< 



VaSaUI^SS^^^UaSaVI - VaSaUIUaSaVI||2 
Va (SAUlfiSS^fi^UASA - SaUIUaSa) Vlh 
SaU^I^SS'^I^'^UaSa — SaUJUaSa||2 

Sa {ulnss'^n^VA - uIua) Sa||2 

A||2||UI1]SS^1]^Ua - UIUa||2||A||2 



Clearly, it suffices to approximate the product UaUa. Overall, in 0{nk'^ + nk'^r): 



l9k, 



lA^nSS^l^^A - A^Alb < a/ — IIAII2IIAII2. 

r 
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Deterministic Asymmetric Multiplication. The above deterministic result 
applies to the so-called symmetric matrix multiplication problem, i.e. for multipli- 
cation involving the product V"'"V. A simple modification of this result suffices to 
extend this to the asymmetric case. Let Vi G M"!^'^ and V2 G M'"^^'^. Let also 
max{ni,n2} > 2k. For now let us assume that V^Vi = V2 = Ifc (the general 
case is discussed later). For simplicity, let us also assume that ni = n2 = n (other- 
wise, padding the matrix with the smaller dimension with zeros resolves this issue). 
Consider the matrix A = [Vi; V2] G W^^^^ and apply the symmetric multiplication 
result (for general matrices) to this A with r > 2k: 



{VjQSS^Q"^Vi - V^Vi) (V^fiSS^fiTV2 - V^Vs) 
From the interlacing property of singular values: 



< 



18k, 



AlUIIAI 



Vi'fiSS^fi^V2-V|V2||2 < 



{yjQSS^Q'^V, - V^Vi) (V^fiSSTfiTv^ - vjv,) 
(V^fiSS^l^^Vi - VjVi) (V^l]SS^fiTV2 - VJV2) 



Also, IIAII2 < V2. Overall, in 0(max{?7,i, 71.2}^;^ 



v^nss^n^V2 - v! V2||2 < 



72k, 



V1II2IIV 



2||2- 



To extend this to general matrices A G M"^^*^, B G IR"^^^, not necessarily 
matrices of ortho normal columns, one needs to proceed as follows. First, compute 
the SVD of A, B in 0{nik^) and 0{n2k'^), respectively. Let the SVD of A is 



kxk 



A = UaSaVI with Ua G M"^^^ Sa G 
B is B = UbSbVI with Ub G M"^^^ Sb G 



and Va G 



pkxk 



Let also the SVD of 



ofcxfc 



, and Vb G 



!>kxk 



Observe that: 



lA^l^SS^l^^B - A^Blh < ||A||2||UT(]SS^1]^Ub - UTUBlbllBlh 



Clearly, it suffices to approximate UJUb- Overall, in 0(A;^(?7,i + ?7,2-l-max{?7,i, n2}r)): 



A^f^SS^fi^B - A^BIU < 



72k. 



AII2IIBII2. 



51 



Low-rank Matrix Approximation 

Although in this thesis we focused on fast column-based low rank matrix ap- 
proximations, such fast low-rank approximations can be obtained by other paths 
than sampling columns from the input matrix A. Examples of this kind of ap- 
proaches include [122, 75, 73] (Frobenius norm) and [97, 106, 141, 73] (spectral 
norm). The main idea in all these papers, except [75], is to compute an SVD of 
the matrix AR or R^A, where R is a random projection matrix as we described in 
Section 3.1.9. [75] computes low-rank approximations based on ideas from compu- 
tational geometry. 

A fast low-rank approximation in the spectral norm with impressive approx- 
imation guarantees appeared in [141]. The algorithm of [141] was analyzed more 
carefully in [73], and slightly more carefully in the current thesis; so. Lemma 60 in 
the Appendix provides, to my best understanding of existing literature and results, 
the most tight analysis of the technique of [141]. 

Existing algorithms for fast low-rank approximations in the Frobenius norm 
are accurate and fast. For example, the method of Sarlos [122] computes a rank-fc 
matrix with (1 + e)-error and failure probability < 5 < 1 in O {mnke~^ log{^) + 
(m -|- n)ke~'^ \og{^)). The method in [75] does so in 0{mnk{e~^ + k) \og{k / {e6))) . 
Below, we contribute a new analysis of an algorithm that employs the Hadamard 
Transform that we discussed in Section 3.1.10. Our analysis delivers a (1 -|- e)-error 
with constant probability in time 

O (m ■ n ■ k ■ hiik) ■ \og{kn) ■ + {m + n) ■ {k'^ ■ \n^{k) ■ log^(fcn) ■ e~^)) • 

A preliminary analysis of this algorithm appeared in Theorem 11.2 of [73] and 
Theorem 1 of [111]. [73] gives an approximation error that is not tight; [111] claims 
a running time that, as far as I can understand, is not correct. More specifically, an 
extra 0{mnd) term (in the notation of [111], d plays the role of r in our notation) 
in the running time seems to be necessary. This is because the computation of the 
best rank-Zc matrix 11^ ^(A) takes at least 0{mnd). [Ill] quotes [36], where this 
supposed to be done in 0{md'^), but this connection is not clear to the author. 
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Relative-error rank k Approximation with the Hadamard Transform. 

Theorem 28. Fix A G M™-^" of rank p, target rank k < p, and an accuracy parame- 
ter Q < e < 1/2. Construct a rank k matrixS G R™^" as follows: 

1: Let r = 200 ■ A; ■ ln(40A;) ■ log(40A;n)/e. 

2: Using definition 23 construct a SRHT matrix G M'"^'^. 

3: Construct the matrix C = AO""". 

4: Using the algorithm of Section 2.2 construct 11^ ^(A). 

5: Return E = Il^ j^{A) G M™-^" of rank at most k. 

Then, with probability at least 0.7; 

||A-S||^ < (1 + e) ||A-Afc||^. 

The algorithm runs in O [mnk \n{k) \og{kn)e~^ + (m + n)(/c^ ln^(/c) log^(A;n)e~^)) . 

Proof. We first comment on running time. Step 3 takes 0(mnlog(r)) (Lemma 24). 
Step 4 takes 0{mnr + (m + n)r^) (from Lemma 4). Our choice of r gives the overal 
running time. We continue by manipulating the term ||A — We would like to 
apply Lemma 6 with W = 0^ and ^ = F. First, notice that Lemma 26 gives: 



^ _ / 8A;ln(2fc/^)log(40fcn) ^ ^2^Qy^^ < ^ ^ . /8fcln(2A;/5) log(40A;n) 



Now, our choice of r implies {6 = 0.05) that w.p. at least 0.9: 

^<a,^(ev,)<i + #. 

25 V25 



The assumption on e < 1/2 and the left hand side of this inequality imply that 
rank(Vj0"'") = k; so, we can apply Lemma 6 (recall that H = 11^ ^(A)): 

||A - Hill < ||A - A.iil + ||(A - A,)e^{vle^m. 



We will return to this generic equation later. First, we prove three results of inde- 
pendent interest. 
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First result of independent interest. Recall that by Lemma 26 and our choice 
of r, for all i = 1, ...,k and w.p. 0.9: 1 - ^ < af(Vje^) < 1 + Let X = 
V^e^ G M'^^^' with SVD: X = UxSxV^. Here, Ux G M*^^^ Ex G and 
Vx G M''^^ By taking the SVD of X+ and X^: 

||(vJeT)+ - (vJeT)T||2 = iiVxSx^uT - VxExU^b == He^^ - Sxlh, 

smce Vx and can be dropped without changing any unitarily invariant norm. 
Let Y = — Sx G M''^^ be diagonal; Assuming that, for all i = l,...,k, niY) 
denotes the i-th diagonal element of Y: rj(Y) = —^f^^- Since Y is a diagonal 
matrix: ^ 

||Y||2 = max|r,(Y)| = max ^^^T^ < ^ 



l<i<k l<i<k Cri(X) A y/I 



/25 

The inequality follows by using the bounds for o",f(X) from above. The failure 
probability is 0.1 because the bounds for erf (X) fail with this probability. Overall, 
we proved that 

Second result of independent interest. Consider the term: || (A— Afc)G^GVfc|||.. 
We would hke to upper bound this term. Recall that 9"^ = HDi7S G R"^''. Eqn. (4) 
of Lemma 4 of [43] gives a result for the above matrix-multiplication-type term and 
any set of probabilities pi,P2, ■■■,Pn (set E = (A — Afc)HD, Z = D^'^H'^V^) : 

E [II (A - Afc)HDD^H^Vfc - (A - Afc)HDfiSS^fi^D^H^Vfc|||] 



,\F- 



First, notice that EZ = Omxfe- Our choice of p^'s is: 



1 1 II (DHV,)(,) \\l 

Pi = -> 



n 2 log(40/cn) 
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By using this inequality and rearranging: 

^r,,/« « N^T^,. 1,91 2A;log(40A;n) , . . ^^^^,,9 2A;log(40A;n) . . 
E [||(A - Ak)e^eVk\\l] < ^||(A-Afc)HD||| = ^ -\\^-Mf 



since HD can be dropped without changing the Frobenius norm. Finally, apply 
Markov's inequality to the random variable x = ||(A — Ak)Q'^QVk\\'F to get that 
with probability 0.9 

||(A-A.)e-^9V.ir^< ^°"°f°'"'' ||A-A.||^. 

Third result of independent interest. We would like to compute an upper 
bound for the term || (A-Afc)G^|||. Replace 9^ = HDf^S G R"''''. Then, Lemma 14 
on the random variable x = ||(A — Afc)0"'"|||, implies that with probability 0.9: 

||(A-Afc)eT|||<10||(A-Afc)HD|||. 

Notice that HD can be dropped without changing the Frobenius norm; so, w.p. 0.9: 

||(A-Afc)e^|||<10||A-A,||^. 

Back to the generic equation. Equipped with the above bounds, we are ready 
to conclude the proof of the theorem. We continue by manipulating our generic 
equation as follows: ||A — < 

< l|A-A,||^ + ||(A-A,)e^(vJeT)+||^ 

< ||A - A,||^ + 2||(A - A,)e^ev,||| + 2||(A - A,)e^((vje^)+ - {vle^f)\\l 

< ||A - A.iiU 2||(A - A.)e^ev.||| + 2||(A - A,)e^|||||((vle^)+ - (vle^fm 

< \\A-Ak\\l + 2^^^iMl^||A - A.IIU 2 . 1011 A - A.|||^ 



25 



^ II* - *'ll^ - {wnim ^ T^J^) ^11* - *'ll'^ ^ <^ ^ ■ '>ll^ - *'ll'^ 

The failure probability follows by a union bound on all the probabilistic events 
involved in the proof of the theorem. ■ 
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Column/row based Matrix Approximation 

This thesis focused on low-rank approximation of matrices expressed as 

A ^ CC+A, 

with C containing columns of A. Factorizations of the form 

A ^ Ci(C+AC+)Cj 

are also of considerable interest. Here, Ci contains columns of A and C2 contains 
columns of A""". We refer the reader to [50] for applications of such column-row 
approximations. In terms of algorithms, both relative error [50] and additive er- 
ror [42, 45] bounds are available (in the Frobenius norm). The relative error algo- 
rithm of [50] employs the randomized technique that we described in Section 3.1.4. 
The additive error algorithm of [42, 45] employs the randomized technique that we 
described in Section 3.1.1. Notice that all [42, 45, 50] provide randomized algo- 
rithms with Frobenius norm bounds and the setting is that one is given A, k and 
an oversampling parameter r > k; the algorithm then returns Ci, C2 containing r 
columns and rows, respectively. 

Deterministic column-row decompositions with spectral norm bounds by se- 
lecting exactly r = k columns/rows are described in [83, 114, 109, 67] and [139, 
140, 65, 64]. The algorithms in [139, 140, 65, 64] construct Ci by running the 
method that we described in Section 3.1.5 on and C2 by running the same de- 
terministic method on U^. It is quite interesting that in these papers appeared a 
preliminary version of Lemma 6 that we described in Section 2.2. The techniques 
of [83, 114, 109, 67] discuss the so-called Rank- Revealing LU factorization. It can be 
proved (well, with a little effort) that a RRLU factorization implies a factorization of 
the form A ^ Ci(C]'"AC^)Cj with provable approximation bounds. (Recall that in 
Section 3.2 we saw that a Rank-Revealing QR factorization implies a factorization 
of the form A ^ CC^A with provable approximation bounds.) 
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Subspace Approximation 

The column-based low-rank matrix approximation problem (in the Frobenius 
norm) studied in this thesis is as follows. Fix A G M"^^", k, and r > k. The goal 
is to find a set of r columns from A that contain a /c-dimensional subspace which 
is as good as the fc- dimensional subspace of the Singular Value Decomposition. We 
focused on algorithms with approximations of the form: 



Let Uyfc G M"^^'^ be the best /c-subspace in the column space of C, i.e. 11^ ^(A) = 
UyfcU^A. Also, let A has column representation A = [ai, a2..., a„] and recall that 
Ayfc = UfcU^A with Ufe G M"*^'^. The above equation is equivalent to: 



to be the p-norm distance of the vector x G from the subspace U G R"^^ . 
The p-norm of a vector x = [xi, ...,Xm] is defined as ||x||p = {Y^^i ■ So, the 

equation corresponding to our problem can be revised as: 



In words, we seek a subset of r points (columns from A) that contain a /c-subspace 
Ufc that is as good as the optimal fc-subspace from the SVD. Replacing p = 2 with 
p = 1, 3, 4, ... corresponds to the more general subspace approximation problem that 
received considerable attention as well. We refer the reader to [127, 38, 37, 52] and 
references therein for background and motivation for this generalized problem. In 
the general case {p ^ 2), SVD does not provide an analytical expression for the best 



A-nS.(A)||^<a||A-A 



k\\F- 




Define the function 



rf(x,U,p)= (||x-UU^x||p)', 
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/i;-subspace but one still seeks algorithms with approximations of the form: 

' n \ ^ / " ' 

^c/(ai,H(p,fc),p) < a ^ci(ai,H(p,fc),p) 



.4=1 / \i=l 



Here H(pfc) G R^x'^ is the best /c-subspace with respect to the p norm: 



H(p,fc) = arg min V'd(ai,H,p); 



i=\ 

and H(p,/c) is the best fc-subspace within the column space of A, i.e H(p,fc) = CX G 
^mxfc^ with X G W""^ as: 

n 

X = arg min y (i(aj,H,p). 

XGR'-xfc_H=CX 

i=l 

It is worth mentioning that the results of this line of research for p = 2 are not 
better than the results we presented in this thesis. For example, [127] describes 
a randomized algorithm that finds a subset of r = 0(/c^e"^ log(/ce~^)) points with 
corresponding approximation a = 1 + e. Unhappily, this algorithm runs in time 
exponential in ke~^ . [38] presents several interesting results that can be viewed as 
extensions of the methods that we presented in Sections 3.1.1 - 3.1.3 for the general 
subspace approximation problem. For example. Theorem 5 from [38] is the analog 
of the additive-error algorithm that we discussed in Section 3.1.1. 

Theorem 29. Fix A, k, p, < e,(5 < 1. There is a randomized algorithm that 
samples r = O (A;(^)^| log(|)) columns from A such that w.p. 1 — there is a k- 
dimensional subspace H5 G M™^'"' within the column space of the sampled columns: 

.i=l J \i=l 

It would be interesting to understand whether the other techniques that we 
used in this thesis for subspace approximation in the p = 2 norm (Frobenius norm) 
can be extended to the more general case oi p ^ 2. 
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Element-wise Matrix Sparsification 

Unlike most of the problems that we saw so far, where one is asked to select 
a subset of columns (or rows) from the input matrix, Element-wise Matrix Spar- 
sification studies matrix approximations by sampling individual elements from the 
matrix. Let A G M™-^" be the input matrix, k be the target rank for the approx- 
imation, and r ^ k he the number of sampled elements from A; we would like to 
construct A G M"^^" with r non-zero entries such that ||A — Afc||^ is as close to 
||A — Afell^ as possible. A^ G M™^'" is the best rank k approximation of A computed 
with the SVD. Notice that A is a sparse matrix, so an SVD on this matrix will 
be faster than an SVD on A, which is dense. We should note thought that the 
later claim can not be proved theoretically; in practice though it is well known that 
algorithms for computing the SVD, such as the Lanczos iteration or the power iter- 
ation, operate much faster on sparse matrices. So, matrix sparsification does offer 
yet another way of fast rank k approximations to matrices. Matrix Sparsification 
pioneered by Achlioptas and McSherry in [2]. We quote Theorem 3 from [2], which 
gives an idea of the approximations that can be achieved using this approach. 

Theorem 30. Fix A G W^"- with 7Q < m < n. Let (5 = maxij|Aij|. For 
any p > 0, define Tij = and pij = maxjrjj, -sj Tij ■ 8^ log^(r2) / n} . Let A be a 

random mxn matrix whose entries are i.i.d. as Ajj = Aij/pij w.p. pij, and A.^ = 
w.p. 1—pij. Then, w.p. 1 — e""'^^^"^*^"-' .■ 

A-Afc 2 < A- Afc 2 + 2^-^; 

VP 



A - Ak F < A - Ak F + — + 2 J \ Ak\ p. 

Vp ^ Vp 

Moreover, if f is the random variable counting the number of non-zero elements in 
A, then 

E[f] <j9- ||A||^/3-2 + 40961og^(n)m. 

We refer the reader to [51] for an updated discussion on existing literature for 
this topic. To our best knowledge, all existing algorithms for matrix sparsification 
are randomized. Recently, [143] presented the first deterministic algorithm. 
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Graph Sparsification 

Given a dense graph Q, there are apphcations where it is required to approxi- 
mate the graph with another sparse graph H, i.e. a graph with a much smaUer num- 
ber of edges than the original graph. We refer the reader to [130, 131, 10, 143, 87] 
and references therein for a discussion of such apphcations. There are many notions 
of graph sparsification. What it really means that a graph approximates another 
graph? Motivated by applications on solving linear equations with Laplacian matri- 
ces, [130] introduced the so-called spectral notion of graph sparsification. Assume 
that Ag and A-^ are the Laplacian matrices of the graphs Q and "H, respectively. 
Then, the graph "H approximates the graph Q "spectrally" if the eigenvalues of A-^ 
are within relative error accuracy from the eigenvalues of Ag (0 < e < 1): 

(1 - e)Ai(Ag) < KiAn) < (1 + e)\i{Ag). 

[131] showed that in order to construct such spectral graph sparsifiers, it suffices 
to preserve, after column sub-sampling, the singular values of a special short-fat 
matrix with orthonormal rows. The number of columns of this matrix is the same 
with the number of edges in the graph, so edge selection corresponds to column 
selection to this special matrix. After this result in [131], it is immediate that 
one can use column sampling algorithms for graph sparsification. For example, 
[131] uses the randomized technique that we described in Section 3.1.4 to construct 
sparsifiers with r = 0(m log(m)/e^) edges; m is the number of vertices in the graph. 
[10] uses the method of Section 3.1.6 to construct sparsifiers with r = 0{m/e^) 
edges deterministically. [143] combines and improves upon the ideas of [131, 10] to 
construct sparsifiers with r = 0(m/e^) edges deterministically but faster than [10]. 
Finally, [87] discusses the construction of such spectral sparsifiers in the streaming 
model of computation. 

Linear Equation Solving 

In a series of papers [129, 90, 91, 92, 93], there were developed fast approxi- 
mation algorithms for solving systems of Linear Equations with Laplacian matrices. 
Let a graph has m vertices and n edges; this corresponds to a Laplacian matrix 
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A G M™^'" with 0{n) non-zero entries. Solving a system with this matrix takes 
O(m^). In a breakthrough paper [129], Spielman and Teng showed how to do that 
approximately in o(m^) time. Subsequent research improved upon the work of [129]. 
Currently, the best such method is [93] which solves this system approximately in 



The basic idea of all these methods is to sparsify the graph (i.e. sparsify the lapla- 
cian matrix) and then use a standard method, such as the Conjugate Gradient 
method, on the sparsified Laplacian. Clearly, the sparsity of the new Laplacian 
matrix improves the computational efficiency of standard methods. The real meat 
in this approach thought is that spectral sparsification of graphs implies relative 
error approximation to the solution vector of the linear system. We refer the reader 
to [129] for the corresponding details. 

Coresets for fc-means Clustering 

One of the three problems studied in this thesis is feature selection for A;-means 
clustering. The idea is to select a subset of the features and by using only this subset 
obtain a partition of the points that is as good as the partition that would have been 
obtained by using all the features. A complementary line of research [76, 77, 57, 53, 3] 
approaches the fc-means problem by sub-sampling the points of the dataset. The 
idea here is to select a small subset of the points and by using only this subset 
obtain a partition for all the points that is as good as the partition that would 
have been obtained by using all the points. [76, 77, 57, 53, 3] offer algorithms for 
(1 + e) approximate partitions. Note that we were able to give only constant factor 
approximations. For example, [57] shows the existence of an (1 + e)-approximate 
coreset of size r = 0(/i;'^/e""^^) (n is the number of features). [53] provides a coreset 
of size r = poly{k, e~^). The techniques used for all these coresets are different from 
the techniques we used for feature selection. It would be interesting to understand 
whether the techniques from [76, 77, 57, 53, 3] are useful for feature selection as 
well. In particular, it appears that there is potential to obtain relative error feature 
selection fc-means algorithms by using such approaches. 



O [n ■ log(m) ■ log(-) ■ log^(log(n)) 
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Trace Approximation 

Let A G M"^" is a square matrix. Tlie trace of A, denoted witli Tr(A) G M, 
equals the sum of its diagonal elements. So, computing the trace of an explicit matrix 
is a simple operation which takes 0{n) time. There are applications though (see, 
for example, [9]) where one needs to compute the trace of an implicit matrix, i.e. a 
matrix /(A) G M"^"; / is some function on A, for example, /(A) = A'^ + 2 ■ A + 1„. 
Computing the trace of /(A), given A, is a rather expensive task. In such cases, 
quick approximations to the exact value Tr(f(A)) are of interest. 

Avron and Toledo [9] wrote a very influential paper on approximation algo- 
rithms for estimating the trace of implicit matrices. Such algorithms were known 
before to perform well in several real applications. On the negative side, there was 
no theoretical analysis for the performance of these algorithms. On an effort to close 
this theory-practice gap, [9] provided a theoretical analysis of several existing trace 
approximation algorithms. Below, we present the basic idea of these algorithms 
and the type of approximations that can be achieved. For simplicity, following the 
discussion in [9], we assume that one is interested in estimating the trace of A. 
Replacing A with some function of A doesn't require any different analysis. 

The main idea of existing trace approximation algorithms is to return an ap- 
proximation Tr(A) that is computed as follows: 



Here, p > is an integer; clearly, the largest we choose p, the better the approx- 
imation is. Of course, choosing a large p affects the running time of the method. 
The vectors Zj G are random vectors chosen from a probability distribution. Dif- 
ferent probability distributions give different trace approximation algorithms. For 
example, Hutchinson's method [82] suggests that a specific Zj has entries where 
each one is chosen i.i.d as a Rademacher random variable (each entry equals ±1 
with the same probability). Hutchinson [82] proved that his estimator is unbiased: 
E [z'^Az] = Tr(A); z G M.^ is a random variable chosen as described above. [9] 
proved the following bound for the approximation Tr(A). Let A is a PSD symmet- 




1=1 
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ric matrix; then, for any < 5, e < 1, it suffices to choose p > ^ln(2n/5) random 
vectors Zj such that w.p. 1 — 5: 



fr(A)-Tr(A) <e-Tr(A). 



An immediate apphcation of fast trace estimators is on counting triangles in a graph. 
Tsourakakis [138] observed that in an undirected graph Q with adjacency matrix 
representation A, the number of triangles T3 in the graph equals 

T3 = ^Tr(A3). 

Avron in [7] used his estimators from [9] to quickly approximate the number of 
triangles in large real graphs. 

Fiedler Vector Approximation 

The discussion here assumes some familiarity with graphs, Laplacian matrices, 
and spectral clustering. We refer the reader to [128] for the necessary background. 
Fiedler [54] made an outstanding contribution to the topic of clustering data. Given 
m points xi,...,Xm in some Euclidean space of dimension n, [54] suggests that 
one can obtain a 2-clustering of the points by using the signs of the eigenvector 
corresponding to the second smallest eigenvalue of the Laplacian Matrix of the 
Graph that corresponds to these points (the idea of spectral clustering [126] basically 
extends this result to any k > 2 partition by working with multiple eigenvectors). To 
recognize the outstanding contribution of [54], this eigenvector is known as Fiedler 
vector. The graph mentioned above has m vertices, and for any two points Xj, x^ the 
corresponding edge denotes the distance between the points; for example, the weight 
in that edge is e"'''~^^"2. Computing this eigenvector takes 0(m'^) through the SVD 
of the Laplacian matrix A G M™^"*. Since data are getting larger and larger, faster 
(approximate) algorithms for the Fiedler vector are of particular interest. 

A result of Mihail [108] indicates that any vector that approximates the Rayleigh 
quotient of the second smallest eigenvalue of the Laplacian matrix can be used to 
partition the points. Influenced by the result of Mihail, Spielman and Teng in [129] 
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give the following definition for an approximate Fiedler vector. 

Definition 31 (Approximate Fiedler Vector). For a Laplacian matrix A and < 
e < 1, V G M"^ is an e- approximate Fiedler vector if\ is orthogonal to the all-ones 
vector and 

vAv , , f A f , , , , . , 

^<(l + 6)^ = (l+6)A„„i(A). 

Uere f G denotes the Fiedler vector. 

Theorem 6.2 in [129] gives a randomized algorithm to compute such an e- 
approximate Fiedler vector w.p. 1 — 5, in time 

O ( nnz(A)log^^(nnz(A))log(^)log(-)e-^ 

This method is based on the fast solvers for linear systems with Laplacian matrices 
developed in the same paper. Another method to compute an e-approximate Fiedler 
vector described recently by Trevisan in [134]. This method uses the power iteration 
and computes such an approximate vector with constant probability in time 

O (nnz(A) log(nnz(A)/e)e"^) . 

Data Mining Applications 

Most of the matrix sampling algorithms described in this thesis are motivated 
by applications involving the analysis of large datasets. Mahoney [103] gives a nice 
overview of how such algorithms are useful in the analysis of social network data 
and data arising fron human genetics applications. Column sampling algorithms 
for analyzing genetics data are also described in [117]. A few more interesting case 
studies can be found, for example, in [105, 133, 22, 124]. For a comprehensive 
treatment of several other data applications we refer the reader to [62, 104, 94], 
which describe the topics of three successful Meetings (Workshops on Algorithms 
for Modern Massive Data Sets) focusing exactly on applications of matrix sampling 
algorithms to data mining problems. 



CHAPTER 4 
LOW-RANK COLUMN-BASED MATRIX 

APPROXIMATION 

We present our results on low-rank column-based matrix approximation. The ob- 
jects of this problem are the m x n matrix A of rank p, the target rank k < p, and 
the number of sampled columns k < r < n. C G M™^*^' contains r columns from 
A and 11^ ^.(A) G M™^"- is the best rank k approximation to A (under the ,^-norm) 
within the column space of C. Recall that, for r > k: 

||A-CC+A||,<||A-4^,(A)||^; 

and when r = fc: ||A — CC^A||g = || A — 11^ ^(A)||g. The goal is to design algorithms 
that construct C with "small" a and guarantee: 

||A - CC+AII5 < ||A - n^c,fc(A)||5 < a||A - A^H^. 

The structure of the present chapter is as follows. In Sections 4.1 and 4.2, we 
assume r > k and present results for spectral norm and Frobenius norm, respectively. 
Section 4.3 presents our algorithms for the Column Subset Selection Problem (r = k) 
and a novel algorithm for an Interpolative Decomposition of a matrix. We give the 
proofs of all the results presented in this chapter in Section 4.3. All the proofs in 
this section are obtained by combining Lemmas 5, 6, 7, and 8 of Section 2.2 along 
with some results of Section 3.1. 

4.1 Spectral Norm Approximation {r > k, ^ = 2) 

By combining the deterministic exact SVD of Lemma 6 in Section 2.2 along 
with the deterministic technique of section 3.1.7, we get our first result on deter- 
ministic column-based matrix reconstruction with respect to the spectral norm. 

^Portions of this chapter previously appeared as: C. Boutsidis, P. Drineas, and M. Magdon- 
Ismail, Near-Optimal Column-Based Matrix Reconstruction, arXiv:1103.0995, 2011. 
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Input: A G M"*^" of rank p, target rank k < p, and oversampling 
parameter r > k. 

Output: CQS G M™^'' with r (rescaled) columns from A. 

1: Compute the matrices and Vp^k with the SVD. 
2: S] = BarrierSamplingII{Vk,y p-k,'^) (Lemma 19 in Section 
3.1.7). 

3: Return C = AilS G M™^^ 



Algorithm 1: Deterministic spectral norm reconstruction with r > k. 



Input: A G M™-^"- of rank p, target rank 2 < k < p, and oversampling 
parameter r > k. 

Output: CQS G M."^^^ with r (rescaled) columns from A. 

1: Z = FastSpectralSVD{A, k, 1) (Lemma 7 in section 2.2). 

2: [Q, S] = BarrierSamplingII{Z,In,r) (Lemma 19 in Section 3.1.7). 

3: Return C = AfiS G M™^'". 



Algorithm 2: Fast spectral norm reconstruction with r > k. 



Theorem 32 (Deterministic spectral norm reconstruction). Given A G M™^" of 
rank p, a target rank k < p, and an oversampling parameter r > k, Algorithm 1 
(deterministically in T(Vfc, Vp-k) + O (rn {k'^ + (p — ^)^)) ) constructs C G M'"^'".- 

I|A - n^c,.(A)|b <{l+ ^-^-[^^^) - A.lb = O [^) ||A - Akh. 

The asymptotic multiplicative error of the above theorem matches a lower bound 
that we prove in Section A. 12. This is the first spectral reconstruction algorithm with 
asymptotically optimal guarantees for arbitrary r > k. Previous work presented 
almost near-optimal algorithms only for r = k [66]. We note that in the proof of 
this theorem in Section 4.3, we will present an algorithm that achieves a slightly 
worse error bound (essentially replacing p by n in the accuracy guarantee), but only 
needs to compute the top k right singular vectors of A (i.e., the matrix V^). 
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Input: A G M"*^" of rank p, target rank k < p, and oversampling 
parameter r > k. 

Output: CQS G M™^'' with r (rescaled) columns from A. 
1: Compute the matrix with the SVD. 

2: [n, S] = BarrierSamplingllliyk, A. — AVfcV^,r) (Lemma 20 in 

Section 3.1.8). 
3: Return C = AfiS G M"^'^. 



Algorithm 3: Deterministic Frobenius norm reconstruction with r > k. 

Next, we describe an algorithm that gives (up to a tiny constant) the same 
bound as Theorem 32 but is considerably more efficient. In particular, there is no 
need to compute the right singular vectors of A; all we need are approximations 
that we can obtain through Lemma 7 of section 2.2. Randomization is the penalty 
to be paid for the improved computational efficiency. 

Theorem 33 (Fast spectral norm reconstruction). Given A G M™^" of rank p, a 
target rank 2 < k < p, and an oversampling parameter r > k, Algorithm 2 (randomly 
in O (mn/c log (/c~^ min{m, n}) + nrk'^)) constructs C G M™^^' such that: 

^[\\A-Yil,{A)h] < (v^+l) (l + YT^) l|A-A.|b = O (v^) IIA-A 



A;||2- 



4.2 Frobenius Norm Approximation {r > k,^ = F) 

By combining the deterministic exact SVD of Lemma 6 of section 2.2 along 
with the deterministic technique of section 3.1.8, we get a deterministic column- 
based matrix reconstruction algorithm in the Frobenius norm. 

Theorem 34 (Deterministic Frobenius norm reconstruction). Given A G M™-^"- of 
rank p, a target rank k < p, and an oversampling parameter r > k, Algorithm 3 
(deterministically in T(Vk) + O {mn + nrk^)) constructs C G R"^^^ such that: 



|A-n^.(A)||^< 



\ (i - 



2i|A- Afelli. < ^/2 + ( ^ )||A- Afeii^. 
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Input: A G M"*^" of rank p, target rank 2 < k < p, and oversampling 
parameter r > k. 

Output: CQS G M™^'' with r (rescaled) columns from A. 

1: Z = FastFrobeniusSVD{A, k, 0.1) (Lemma 8 in section 2.2). 
2: [Tl, S] = Barrier S amplingi 1 1 {7j , A — AZZ"*", r) (Lemma 20 in 

Section 3.1.8). 
3: Return C = AfiS G M'"^''. 



Algorithm 4: Fast Frobenius norm reconstruction with r > k. 

By combining the randomized approximate SVD of Lemma 8 of section 2.2 
along with the deterministic technique of section 3.1.8, we get a considerably faster 
randomized column-based matrix reconstruction algorithm in the Frobenius norm. 

Theorem 35 (Fast Frobenius norm reconstruction). Given A G M™^" of rank p, a 
target rank 2 < k < p, and an ov er sampling parameter r > k, Algorithm 4 (randomly 
in O {mnk + nrk^) ) constructs C G M™^'' such that: 



E[||A-n^,,(A)||^] < 



1-1 + 7 — — 72 l|A-Afe||F < 




Previous work presented deterministic near-optimal algorithms only for r = k 
[34]. We are not aware of any deterministic algorithm for the r > k case; previous 
work presents only randomized algorithms that fail unless r = Q{klog{k)) [47, 39]. 

Both Theorems 34 and 35 guarantee constant factor approximations to the 
error ||A — Afc||ir. In both cases, for arbitrary small e > 0, if r = 0{k/e): 

l|A-n^,,(A)i|^<(/3 + e) ||A-A,||^, 

(In Theorem 34, /3 = 2; in Theorem 35, (3 = 3). We stated the result here with the 
squares since it is stronger and is necessary to conclude that r = 0{k/e) columns 
give constant factor approximations. Manipulating the bound without the squares 
yields r = 0{k/e^) columns, which is weaker. The stronger bound with the squares 
is possible and can be found in the proofs of the corresponding theorems. 
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Input: A G M"*^" of rank p, target rank 2 < k < p, and oversampling 
parameter r > 10k. 

Output: CQS G M™^'' with r (rescaled) columns from A. 

1: Z = FastFrobeniusSVD{A, k, 0.1) (Lemma 8 in section 2.2). 
2: [Tl, S] = BarrierSamplingIII(Z, A — AZZ'^,Ak) (Lemma 20 in 
Section 3.1.8). 

3: C2 = AdaptiveSamplinglA, AQ,r — 4k) (Section 3.1.2). 
4: Return C = [C2; Afi] G M™''^ 

Algorithm 5: Fast relative-error Frobenius norm reconstruction. 

Constant factor approximations are interesting but not optimal. Here, we 
describe an algorithm that guarantees (1 + e)-error Frobenius norm approximation. 
We do so by combining the algorithm of Theorem 35 with one round of adaptive 
sampling [39], i.e. with the randomized technique described in section 3.1.2. This is 
the first relative-error approximation algorithm for Frobenius norm reconstruction 
that uses a linear number of columns in k (the target rank). Previous work [46, 122] 
achieves relative error with 0{klogk + ke~^) columns. Our result is asymptotically 
optimal, matching the fl{k/e) lower bound in [39]. 

Theorem 36 (Fast relative-error Frobenius norm reconstruction). Given A G M™-^" 
of rank p, a target rank 2 < k < p, and an oversampling parameter r > 10k, 
Algorithm 5 (randomly in O {rank + nk^ + nlog{r))) constructs C G M™^'' such 
that: 

E [||A - hS,,(A)||^] < Y^l+^^IIA - A.lli.. 

Intheproof of the theorem we get: E [|| A - Hg (A) |||,] < (1 + ^)||A-Afc|||,. For 
any given e > 0, choosing r > 4:k+6k/e > 10k /e columns gives E [|| A — Hq ^(A)|||.] < 
(1 + e) II A — Afc Wjp. Taking square roots on both sides of this equation and observing 
that yiTe < 1 + e: y^E [|| A - Hg^fc(A)|||] < (1 + e)||A - AfcH^.. Finally, Holder's 

inequality implies: E [||A - U^i^{A)\\f] < y^E [\\A -U^^^{A)\\l]. Overall, 
E[||A-HS^,(A)||^] <(1 + 6)||A-A,||^. 
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Note on Running Times. 

All our running times are stated in terms of the number of operations needed 
to compute the matrix C, and for simplicity we assume that A is dense; if A is 
sparse, additional savings might be possible. Our accuracy guarantees are stated 
in terms of the optimal matrix nQ^(A), which would require additional time to 
compute. For the Frobenius norm = F), the computation of 11^;, (A) is straight- 
forward, and only requires an additional O {mnr + (m + ra) r^) time (see the dis- 
cussion in Section 2.1). However, for the spectral norm = 2), we are not aware 
of any algorithm to compute IIqi^{A) exactly. In Section 2.1 we presented a sim- 
ple approach that computes Hqj^^A), a constant-factor approximation to Uqj^IA), 
in O {mnr + (m + n) r^) time. Our bounds in Theorems 32 and 33 could now be 
restated in terms of the error || A — 11^ ^,(A) H^; the only change in the accuracy 
guarantees would be a multiplicative increase of \/2 (from Lemma 4). 

4.3 The Column Subset Selection Problem {r = k, ^ = 2, F) 

Our focus has been on selecting r > k columns; however, selecting exactly k 
columns is also of considerable interest [19, 34, 107]. Selecting exactly k columns 
from A is known as the Column Subset Selection Problem (CSSP). We present novel 
algorithms for the CSSP below. 

The basic idea of our algorithms is to use the algorithms in Theorems 33 and 
35 to select r = dk columns, with, for example, some small constant d = 2,3,4..., 
and then down sample this to exactly k columns using the deterministic RRQR 
technique of section 3.1.5. We will use exactly this approach to obtain k columns 
for Frobenius reconstruction (Theorem 38). For spectral reconstruction, selecting 
k columns using the deterministic RRQR technique of section 3.1.5 suffices to give 
near-optimal results (Theorem 37). We also present a two-step algorithm that gives 
bounds for both the spectral and the Frobenius norm but uses the randomized 
technique of Section 3.1.4 in the first step (Theorem 39). For the exact description 
of the algorithms in the three theorems below see the corresponding (constructive) 
proofs in Section 4.3. 
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Theorem 37 (Randomized Spectral CSSP). Given A G M"^^" of rank p and a 
target rank 2 < k < p, there is a 0{mnklog{m.m{m,n)/k) + nk'^logn) randomized 
algorithm to construct C G IR™'^'^ such that 

E [||A - CC+AII2] < 4v/4A;(n-A;) + l)||A - A^h- 

To put our result into perspective, the best known algorithm for spectral-norm CSSP 
[66, Algorithm 4], is deterministic, runs in 0{kmnlog{n)), and gives: 

||A - CC+AII2 < v/(4A;(r2-A;) + l)||A - A^h- 

We are worse by a constant 4 but faster by 0(1/ log(l/A;)). 

Theorem 38 (Randomized Frobenius CSSP). Given A G M™^" of rank p and a 
target rank 2 < k < p, there is an 0{mnk + nk^ + k^\og{k))) randomized algorithm 
to construct C G M'^^'^ such that 

E [||A - CC+A||f] < 9A;||A - AfcH^. 

To put our result into perspective, the best known algorithm for Frobenius-norm 
CSSP [34, Theorem 9, e = 1/4], which runs in 0{mnk'^ log(m)+mk'^ log'^(m) log(/c logm 
gives: 

E [||A - CC+A||^] < v/l.25(A; + l)||A - A^j]^, 
We are worse by 0{k) but faster by at least 0{l/{k\og{m))). 

Theorem 39 (Randomized Spectral/Frobenius CSSP). Given A G M'"^" of rank 
p, 1 < k < p, and < 5 < 1, there is an algorithm to construct C G M™-^'"' in 
O {mnk + k'^ ■ hi{k/5) + k ■ \\i{k/5) ■ log {k ■ hi{k/5)))) such that w.p. 1 - 36: 



For Frobenius norm, treating 5 as a constant, the approximation error is 
0{k^y\og{k)); for spectral norm, we get a somewhat unusual bound, i.e. the right 
hand side contains the term ||A — Afe||i? < y/ p — k\\A — Afc||2. 
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Interpolative Decompositions 

We conclude this section by describing a novel algorithm for an Interpolative 
Decomposition of a matrix. We start by defining such a decomposition. 

Lemma 40 (Interpolative Decomposition (ID) - Lemma 1 in [97]). Let A E W^^"- 
and k < min{m, n}; then, there is a matrix C G M™^^ containing columns of A and 
a matrix X G M'^^" such that 

1. some subset of the columns of~K. makes up the Ik- 

2. no entry of X has absolute value greater than one. 

3- ||X||2 < pi{k, n), with pi{k, n) = \Jk{n — k) + 1. 

4- the k-th singular value o/X is at least one. 

5. if k = m or k = n, then A = CX. 

6. if k < m,n, then: || A — CX||2 < P2{k,n)\\A — Ak\\2', P2{k,n) = ^yk{n — k) + 1. 

Notice that an ID is similar with the decompositions studied in this thesis 
with the only difference being the matrix X; we typically choose X = C~'^A but 
this choice might not be numerically stable when the singular values of C are very 
small. An ID introduces several properties for the matrix X making sure that the 
approximation A ^ CX is numerically stable. An algorithm for computing such 
an ID is in observation 3.3 of [118] (All [72, 97, 118, 73] study interpolative decom- 
positions as defined in Lemma 40); this algorithm runs in 0{kmnlog{n)) time and 
computes a factorization that slightly sacrifices properties (2), (3), and (6) from 
the above definition. More specifically, it computes C,X such that: in (2), no en- 
try of X has an absolute value greater than 2; in (3), pi{k,n) = Akin — A;) + 1; 
similarly, in (6), p2{k,n) = ^jAk{n — k) + 1. A quick modification of the algo- 
rithm of Theorem 37 (see the proof for the details of the algorithms) gives a novel 
O {mnk \og{k~^ min(m, n)) + nk'^ log(n)) time randomized algorithm: 

1: Via Lemma 7 of section 2.2, let Z = FastSpectralSVD{A, k, 0.5); 
2: Via Lemma 16 of section 3.1.5, let Q = RRQRSampling{Z, k); 
3: Return C = Afi and X = (Z^fi)+Z^. 
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Analysis. Without loss of generality, assume that Q samples the first k columns 
of Z^. Let Z = [Zi;Z2], with Zi G R^""^ and Z2 G M("-'^)><^ From Lemma 15, 
Z'^n = Q[Afc, Bfc]; so Z^ = QAfc and Zj = QB^. By using this notation, 

X= [(QAfe)+(QAfc),(QAfc)+(QBfe)] = [I,,A+Bfc]. 

Now, we comment on all six properties of the above definition for the output C, X 
from our method. 

1. In Lemma 40, (1) is obviously satisfied by the above choice of X. 

2. Theorem 3.2 of [66], with / = 2, shows that no entry in X has an absolute 
value greater than 2. 



3. From Lemma 16: ||X||2 = || (Z^fi)+Z^||2 < ||(Z^n)+||2 < pi{k,n)\\{Z^ 
,/4k{n -k) + 1^^ = {^4k{n-k) + 1)1 = ^^kin - /e) + 1. 

4. (Jmin.(X) > 1 because 

a™„(X) = afc((ZTn)+ZT) = a^{{T7^Y ^ 
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and the fact that (Jmax^T^^ < o'maxi'Z''^) = 1, by the interlacing property of 
the singular values. 

5. Via Lemma 7, we computed a factorization A = AZZ"'" + E. Lemma 56 argues 
that rank(Z) = k with probability one. Also, since in this case A = A^, 
E = 0„ 

xn w.p. 1. Plug this factorization into Lemma 5 and construct the 
matrix W as we described in Theorem 37. Theorem 7.2 in [66] gives that 
rank(Z'^W) = k. So, || A-CC+A||| = 0, which also means that || A-CX|||, = 
0, i.e. A = ex. 

6. Property (6) is satisfied with p2 = A-\/Ak{n — k) + 1] to see this, notice that 
the bound for ||A — CC^A||2 also holds for ||A — CX||2 because this is the 
way we proved Lemma 5. 
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Proofs 

Proof of Theorem 32 

1: Via the SVD compute the matrices Vfc, Vp_fc of the right singular vectors of A. 

2: Via Lemma 19 of section 3.1.7, let [Q, S] = BarrierSamplingII(Vk,^p-k,f)- 

3: Return C = AQS with rescaled columns of A. 
Lemma 19 guarantees that (Tfc(V^r2S) > 1 — \fkjr > (assuming r > k), and so 
rank(V^fiS) = k. Also, ai(V^^^nS) = || Vj^^.f^SH^ < 1 + ^/{p-k)/r. Applying 
Lemma 6, we get 

l|A-n^,,,(A)||^ < 

< 

< 
< 



where the last inequality follows because ||Sp_fc||2 = ||A — Afe||2 and ||(V^fiS)"^||2 = 
l/(Tfc(V^fiS) < 1/(1 - v^)- Theorem 32 now follows by taking square roots of 
both sides. The running time is equal to the time needed to compute V^ and Vp-k 
(denoted as T(Vfc, Vp^k) plus the running time of the algorithm in Lemma 19. ■ 

A faster spectral norm deterministic algorithm. Our next theorem describes 
a deterministic algorithm for spectral norm reconstruction that only needs to com- 
pute Vfc and will serve as a prequel to the proof of Theorem 33. The accuracy 
guarantee of this theorem is essentially identical to the one in Theorem 32, with 
p — k being replaced by n. 

Theorem 41. Given A G M"^^" of rank p, a target rank k < p, and an oversampling 
parameter r > k, there exists a T(V k)+0{nrk^) deterministic algorithm to construct 
C e M'"^^ such that 

IIA - nUA)|b < (l + i±^) ||A - A.II. = o[^HA- A,||, 





- Afcl 


|A 


- Afcl 


|A 


- Afcl 


|A 


-Afcl 


|A 


-Afcl 



||(A-Afc)fiS||^i|(Vri^S)+||^ 
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ls....ii^iivL,i^s||^||(vrfiS)+||^ 
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Proof. The proof is very similar to the proof of Theorem 32, so we only highlight the 
differences. We first give the algorithm. 

1: Via the SVD compute the matrix of the right singular vectors of A. 

2: Via Lemma 19 of section 3.1.7, let [Q, S] = BarrierSamplingII{Vk,ln,i^)- 

3: Return C = AQS with rescaled columns of A. 

Lemma 19 guarantees that ||I„S||2 < 1 + ^/n/r. We now replicate the proof of The- 
orem 32 up to the point where ||(A — A;j)f2S(V^r2S)"'"||2 is bounded. We continue 
as follows: 



||(A - Ak)nsivlns)^\\l < ||(A - Ak)\\l\\inns\\l\\ivjns) 




The remainder of the proof now follows the same line as in Theorem 32. To ana- 
lyze the running time of the proposed algorithm, we need to look more closely at 
Lemma 62 and the related Algorithm 12 and note that since one set of input vectors 
consists of the standard basis vectors. Algorithm 12 runs in 0{nrk'^) time (see the 
discussion of the running time for this algorithm). The total running time is the 
time needed to compute V^ plus 0(nr/c^). ■ 

Proof of Theorem 33 

1: Via Lemma 7 of Section 2.2, let Z = FastSpectralSV D{A, k, 1). 

2: Via Lemma 19 of section 3.1.7, let [Q, S] = BarrierSamplingII{Z,In,r) 

3: Return C = AQS with rescaled columns of A. 
In order to prove Theorem 33 we will follow the proof of Theorem 32 using Lemma 7 
(a fast matrix factorization) instead of Lemma 6 (the exact SVD of A). More 
specifically, instead of using the top k right singular vectors of A (the matrix V^), 
we use the matrix Z G M"^'^ of Lemma 7. The proof of Theorem 33 is now identical 
to the proof of Theorem 41, except for using Lemma 5 instead of Lemma 6 in the 
first step of the proof: 

||A - Uli^{A)\\l < \\E\\l + ||EfiS(Z^fiS)+||^ = \\E\\l + ||EI„fiS(Z^fiS)+||^ 
< ||E||^ (1 + ||I„fiS||^||(Z^nS)+||^) , 
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where E is the residual error from the matrix factorization of Lemma 7. Taking 
square roots of both sides, we get 

\\A-Ul,iA)\\^ < ||Emi + ||I„fiS|y|(ZTl]S)+||J. 

We can now use the bounds guaranteed by Lemma 19 for ||InS||2 and || (Z"'"S)'^||2, 
to obtain a bound in terms of ||E||2. Finally, since E is a random variable, taking 
expectations and applying the bound of Lemma 7 concludes the proof of the theorem. 
The overall running time is equal to the time needed to compute the matrix Z from 
Lemma 7 plus an additional 0{nrk'^) time as in Theorem 41. ■ 

Proof of Theorem 34 

1: Via the SVD compute the matrix of the right singular vectors of A. 
2: Via Lemma 20 of section 3.1.8, let S] = BarrierSamplingllliY A — 
AVfcV^,r). 

3: Return C = Af2S with rescaled columns of A. 

We follow the proof of Theorem 32 in the previous section up to the point where we 
need to bound the term ||(A — Afc)f2S(V^f2S)"'"||^. By spectral submultiplicativity, 

||(A - A,)nS(V^nS)+||^ < ||(A - A,)fiS|||||(V^fiS)+||i 

To conclude, we apply Lemma 20 of Section 3.1.8 to bound the two terms in the right- 
hand side of the above inequality and take square roots on the resulting equation. 
The running time of the proposed algorithm is equal to the time needed to compute 
Vfe plus the time needed to compute A — AV^V^ (which is equal to 0{mnk) given 
Vfe), plus the time needed to run the algorithm of Lemma 20, which is equal to 
O {nrk"^ + mn). ■ 

Proof of Theorem 35 

We will follow the proof of Theorem 34, but, as with the proof of Theo- 
rem 33, instead of using the top k right singular vectors of A (the matrix V^), 
we will use the matrix Z of Lemma 8 that is computed via a fast factorization. 
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1: Via Lemma 8 of section 2.2, let Z = FastFrobemusSVD{A, k, 0.1); 

2: Via Lemma 20 of section 3.1.8, let [fi, S] = BarrierSamplingIII(Z, A— AZZ""", r). 

3: Return C = AQS with rescaled columns of A. 
The proof of Theorem 35 is now identical to the proof of Theorem 34, except for 
using Lemma 5 instead of Lemma 6. Ultimately, we obtain 

||A-n^_fc(A)||| < ||E||^+ ||E(]S(ZTnS)+||2 

< ||E||^+ ||El]S||^||(Z^fiS)+||^ 

< (l+(l-V^)"')l|E|||. 

The last inequality follows from the bounds of Lemma 20. The theorem now fol- 
lows by taking the expectation of both sides, using Lemma 8 to bound E[||E||p], 
taking squares roots on both sides of the resulting equation, and using Holder's 
inequality (i.e Lemma 58 with x = ||A - ng;,(A)|||,) to get E [||A - ng;.(A)||i.] < 
y^E [II A — IIq ^(A)|||,] . The overall running time is derived by replacing the time 
needed to compute V^ in Theorem 34 with the time needed to compute the fast 
approximate factorization of Lemma 8. ■ 

Proof of Theorem 36 

1: Via Lemma 8 of section 2.2, let Z = FastFrobemusSVD{A, k, 0.1). 
2: Via Lemma 20 of section 3.1.8, let [fi, S] = BarrierSamplingIII{Z, A— AZZ""", 4/c). 
3: Via the technique of section 3.1.2, let C2 = AdaptiveSampling{A, AQ, r — Ak); 
4: Return C containing the columns of both AQS and C2 

First, let f = 4k and 

Co = (1 + 0.1) I 1 + 1 = ^. 

Notice that the first two steps of the algorithm correspond to running the algorithm 
of Theorem 35 to sample f = Ak columns of A and form the matrix Ci = Af2S. 
The third step corresponds to running the adaptive sampling algorithm of Lemma 10 
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with B = A — CiCj^A and sampling a further s = [r — 4/c] columns of A to form 
the matrix C2. Let C = [Ci C2] G M™^^' contain all the sampled columns. We will 
analyze the expectation E || A — n^^^(A)||^ . 

Using the bound of Lemma 10, we first compute the expectation with respect 
to C2 conditioned on Ci: 



A-nS.(A)||; C, <||A-A,||| + -||B 



We now compute the expectation with respect to Ci (only B depends on Ci): 

Ec, [Ec, [||A - n^,,(A)||^| Ci]] < 1|A - A,||^ + ^Ec, [1|A - CiC+A|ll] ■ (4.1) 

By the law of iterated expectation, the left hand side is exactly equal to the term 
E [II A — nQ;j(A)|||.]. We now use the accuracy guarantee of Theorem 35 and our 
choice of Cq: 



Ec, [||A - CiC+A||^] < Ec, [||A - n^„.(A)|||] < co||A - A 



2 

k Ho- 



using this bound in (4.1), we obtain: 



E[||A-nS,.(A)|||] <(l + ^) ||A-A,| 



Using Co < 6 and our choice of s gives: 



E[||A-n^,,(A)|||] < 1 



6k 



r — Ak 



|A-A,||^. 



Taking square roots on both sides of this equation: 



EO|A-ng^,(A)|||] <./l + 



Qk 



r — Ak 



lA - A, 



kWF- 



To wrap up, use Holder's inequality as: 



E [l|A - nS,,(A)||^] < Je [II a - iil,mi] 
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The time needed to compute the matrix C is the sum of two terms: the running 
time of Theorem 35 (which is 0{mnk + nrk'^)), plus the time of Lemma 10 (which is 
0{mfmin{'m,n}+rnnf + n + slog{s))). The total run time 0{mnk + nk^ + n\og{r)) 
follows because f = 4k and s = r — 4k<r<n. ■ 

Proof of Theorem 37 

1: Using Lemma 7 in section 2.2, let Z = FastSpectralSV D{A, k, 0.5). 
2: Using Lemma 16 in section 3.1.5, let Q = RRQRSampling{Z, k). 
3: Return C = Ail. 

Lemma 16 implies that rank(Z"'") = rank(Z"'"f2) = k, so we can apply Lemma 5: 

||A - CC+A||^ < ||E||2 + \\EQ{Z^Q)+\\l 

By submultiplicativity, \\En{Z'^n)+\\l < \\E\\l\\n\\l\\{Z^n) + \\l. Since is a subset 
of a permutation matrix, W^^Wl = 1- By Lemma 16, ||(Z^r2) + ||2 < {4:k(r — k) + 
imZ^m Using ||(Z^)+||Hl: 

||A - CC+A||2 < ||E||2 + ||E||2(4A;(r - k) + 1). 

Taking square roots on the last expression: ||A-CC+A||2 < ||E||2(l+v/(4A;(r-/c) + l)). 
Finally, taking expectations over the randomness of E = A — AZZ^ and using 
Lemma 7 with e = 0.5 gives the result: 

||A - CC+AII2 < ||E||2(1 + V(4fc(r-fc) + l)) < 4^{4k{r-k) + l)\\A - A^h- 

The run time follows by the run time of Lemma 7 and Corollary 16. ■ 

Proof of Theorem 38 

1: Using Lemma 8 in Section 2.2, let Z = FastFrobemusSVD{A, k, 1/2). 

2: Using Lemma 20 of Section 3.1.8, let [f2i,S] = BarrierSamplingIII{Z, A — 

AZZ^,Ak). Define = Z^fiiS; so X e M^^^^ 
3: Via Lemma 16 of Section 3.1.5, let = RRQRSampling{X, k); ^2 e M^^^^\ 
4: Return C = AQiSQi- 
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Lemma 20 implies that rank(Z^r2iS) = rank(Z"'") = k; Lemma 16 implies that 
rank(X'^) = rank(X^l]2). Thus, letting W = l^iSfiz, rank(Z'^W) = k. So, we can 
apply Lemma 5: 

||A - CC+AIII < ||E||| + ||Ef]iSf]2(Z'^l^iSf]2)'^||F- 
By spectral submultiplicativity, 

||Ef]lSn2(Z^filSl]2)+||| < ||El]iS|||||n2||2ll(Z^^^lS^^2)+||2- 

By Lemma 20, ||EniS|||. < ||E|||,; since Q2 is a subset of a permutation matrix, 
11^2111 = 1. By Lemma 16, with = Z^^iS, ||(X^fi2)+||i < (4A;(4A; - k) + 



and again by Lemma 20, ||(X^) + ||2 < 1/(1 - ^/TJly. Combining all 
these results together and using the bound from Lemma 7 with e = 1/2, we have: 



(1 - v^)2 



< 53||E|||. 



Taking expectations on the later equation using Lemma 8 with e = 0.5: 

E [II A - CC+AIII] < 53(1 + 0.5)E [|| A - Afc|||] . 

Taking square roots on this equation and using Holder's inequality concludes the 
proof. Finally, the total run time follows by combining the run time of Lemma 8, 
Lemma 20, and Lemma 15, which are 0{mnk), 0{nk^ + mn), and 0{k^ log(A;)). ■ 

Proof of Theorem 39 

1: Via Lemma 8 of Section 2.2, let Z = FastFrobeniusSVD{A, k, 0.5); 



2: Via Lemma 13 of Section 3.1.4, let [^li, S] = RandomSampling(Z, l,8/cln(^)); 

Define X^ = Z^l^iS; so X G RSkin{f)xk_ 
3: Via Lemma 16 of Section 3.1.4, let 1^2 = RRQ RS ampling{yJ^) . 
4: Return C = AfiiSl]2. 
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Lemma 20 implies that w.p. at least 1—S rank(Z^r2iS) = rank(Z"'") = k; Lemma 16 
implies that rank(X^) = rank(X^fi2), for = Z^fiiS. Thus, letting W = f^iSl^s, 
rank(Z^W) = k. So, we can apply Lemma 5: 

||A - CC+A||| < ||E||| + ||Ef]iSl]2(Z^l^iSf]2)+||g. 

By spectral submultiplicativity and the fact that ||Ef2iSr2i < ||Er2iSf2i |||,: 

||EniS(]2(Z^l^lSfi2)+||? < ||El]iS|||||l]2||2ll(Z^^lSf^2)"^|l2- 

By Lemma 14, with Y = E and failure probability S: \\EQS\\p < |||E|||,, and 
since ^2 is a subset of a permutation matrix, 11^^211 2 = 1- By Lemma 16, with 
X^ = Z^QiS, 

||(X^fi2) + ||2 < {4:k{8k\n{2k/6) - k) + 1)||(X^) + ||^. 

Also, from Lemma 13, ||(Z^(]S) + ||^ < 1/0.3 w.p. at least 1 — 6. Combine all these 
results together, use ||E||| < ||E|||, and apply a union bound to get - so far - that 
w.p. at least 1 — 36, 

||A-CC+A||?<l??^!l«||Efr 

^ 6 

Using Lemma 8 with e = 0.5: E[||E|||.] < 1.5||A — Afc|||,. Applying Markov's 
inequality on the random variable x = ||E|||n, we conclude that w.p. 1—6: ||E|||n < 
1.5/(5||A — Afc|||,. Replacing this to the above equation and taking square roots on 
both sides we get that w.p. 1—46 



|A-CC-A||,<M1Z5|1A-A,||,. 



Finally, the run time of the algorithm is 0{mnk) + 0{n + klog{k/6) \og{klog{k/6))) 
+ 0{k^\n{k/6)), from the first, second, and third step, respectively. ■ 



CHAPTER 5 

CORESET CONSTRUCTION IN LEAST-SQUARES 

REGRESSION 



5.1 Coreset Construction with Deterministic Sampling 

Given A G R™^" (m ^ n) of rank p, b G M™, and V C M", the regression 
problem asks to find Xopt G V for which || Axopt — b||2 < || Ax — b||2, for all x G I'; 
the domain V represents the constraints on x and can be arbitrary. A coreset of size 
r < m is C = S"'"r2"'"A, be = S"'^f2"'"b, for some sampling and rescaling matrices G 
j^mxr^ S G W^'''. We assume that e > 0, which denotes the coreset approximation, is 
given as input. The goal is to construct a coreset of size r, with r being as small as 
possible and, ||Axopt — b||2 < (1 + e) || Axopf — b||2; Xopt = argmin ||S"'"fi"'"(Ax — b)||2. 

Theorem 42. Given A G M™^" of rank p = n, h E W^, and < e < |, Algorithm 6 
constructs Vt G M™^*" and S G W"^'' with r = [225(n + l)e~^] in time 0{mn^ + 
mn^ /e^) such that 5topt £ IR" obtained from S'^^l'^A, S'^^l'^h satisfies 

\\AScopt - b||^ < (1 + e) IIAxopt - b||^. 



Proof Let Y = [A, b] G 



pmx k 



^mx(n+i)^ and let its SVD: Y = UySyV^, where Uy G 
Sy G M^''^'^ and Vy G M("+i)><^ Here, k = rank(Y) < p + 1 = rank(A) + 1. 
Let r > p + 1 be a sampling parameter whose exact value will be specified later. Let 
[Q, S] = Barrier Sampling I (Uy^r). Let yi,y2 G M'^ defined as 



yi 



SyV 



y Vy 



^opt 



and 



y2 



^opt 



Note that Uyyi = Axop* — b, Uyy2 = Axopt — b, S^f2^Uyyi = S^f2^(Axopt — b 



^Portions of this chapter previously appeared as: C. Boutsidis and P. Drineas, Random Pro- 
jections for the Nonnegative Least Squares Problem, Linear Algebra and its Applications, 431(5- 
7):760-771, 2009, and as: C. Boutsidis, P. Drineas, M. Magdon-Ismail, Rich Coresets for Con- 
strained Linear Regression Manuscript, 2011. 
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Input: A e M*"^" of rank p = n, b G M™, and < e < 1/3. 
Output: n e M'"''^ S G W'' with r = [225(ri + l)e-2 ] . 

1: Compute the matrix Uy G M"^^'^' of the top k left singular vectors 

of Y = [A, b] G M™x("+i); A; = rank(Y) < rank(A) + 1 = p + 1. 
2: Let r = [225(n + l)e~2^. 

3: return [^2,8] = BarrierSamplingI{UY,r). (Corollary 18) 



Algorithm 6: Deterministic coreset for constrained regression, 
and S'^f2^UYy2 = S'^r2'^(Axopf— b). We need to bound ||UYy2|| in terms of ||UYyi||: 

l-\lfj ||UYy2||^<||S^(]^UYy2||^<^|S^fi^UYyi||^< (^1 + y^j ||UYyi||i 

(a) and (c) use Corollary 18; (b) follows because S^opt is optimal for the coreset 
regression. After reorganization, using k < p + 1, and assuming r > 9(p + 1), 



r 

l-^/^ 



2 



\A±opt - Hi < ^^7==^||Axopt - b||2 < ( 1 + 1 IIAxopt - b||2 



Setting r 



225(p+l) 



gives the bound in the Theorem. The overall running time 
is the sum of two terms: the time to compute Uy via the SVD and the time to run 
the Barrier Sampling I method on Uy (Lemma 18 with k < p + 1). m 

Theorem 42 improves on [Theorem 3.1, [47]] and [Theorem 5, [50]] which 
require coresets of size r = 3492n^ ln(3/5)/e^ and r = 0(nlog(n)/e^) to achieve 
(1 + e)-error w.p. 1 — 6 and 0.5, respectively. The coresets of [47, 50] are useful 
only for unconstrained regression. Both [47, 50] and our result use the SVD, so 
they are computationally comparable. It is interesting to note here that Algorithm 
6 at Theorem 42 applies the method of Section 3.1.6 to the left singular vectors of 
[A; b] G M™x("+i) whereas [47, 50] work with the left singular vectors of A G W""" 
(see Section 3.3). Next, by applying SubspaceSampling on the left singular vectors 
of Y we obtain a coreset with high probability and arbitrary constraints. 
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Input: A e M"*^" of rank p = n, b G M™, < 5 < 1, and < e < 1/3. 
Output: Q e M"'''^ S G W'' with r : 



36(n+l)ln(2(n.+l)/(5) 
^2 



1: Compute the matrix Uy G M™^'^ of the top k left singular vectors 
of Y = [A, b] G M™x("+i); A: = rank(Y) < rank(A) + 1 = p + 1. 



2: Let r 



36(n.+l) ln(2(n+l)/(5) 



3: return [^2,8] = SubspaceSampling{UY,^,r) (Lemma 13) 



Algorithm 7: Randomized coreset for constrained regression. 

5.2 Coreset Construction with Randomized Sampling 

The coreset construction algorithm of this section is similar with the one pre- 
sented in Section 5.1 with the only difference being the use of the randomized tech- 
nique of Section 3.1.4 instead of the deterministic technique of Section 3.1.6 that we 
used in Algorithm 6. Algorithm 7 finds a coreset with high probability for arbitrary 
constrained regression. Previous randomized coreset construction algorithms [47, 50] 
can not handle arbitrary constraints and succeed only with constant probability [50] . 

The algorithm of this section will serve as a prequel to the "coreset" con- 
struction algorithm that we will present in Section 5.3. The algorithm of Section 

5.3 is very similar with Algorithm 7 presented here, with the only difference being 
the fact that, before applying the Subspace Sampling method on the left singular 
vectors of the matrix containing both A and b, we will pre-multiply A and b with 
the Randomized Hadamard Transform that we presented in Section 3.1.10. The 
algorithm of Section 5.3 works for arbitrary constraint regression but succeeds only 
with constant probability, which is due to Lemma 25 of Section 3.1.10. 

Theorem 43. Given A G W""" of rank p = n, h e < S < 1, and < 
e < I, Algorithm 7 constructs Q G M™^", S G W'' with r = T M!l±iii£^MMi 
in time O (m?i^ + nln(n/(5) ■ ■ log(nln(n/(5)e~^)) such that Xopt obtained from 
S'^Jl'^A, S"'"f2'^b satisfies w.p. 1 — 6 

IjAxopt - b||2 < (1 + e) II Axopt - b||2; 
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Proof. Let Y = [A,b] e M™x(n+i)^ ^^^^ gyO: Y = UySyV^, where Uy E 

^mxfc^ Sy e R'''"' and Vy G M("+i)^^ Here k = rank(Y) < p + 1 = rank(A) + 1. 
For the failure probabihty 6 of the Theorem, let r > 4k\n{2k/6) be a sampling 
parameter whose exact value we will be specified later. Let 

[Q, S] = Subspace Sampling I {Uy, l,r). 

Let yi,y2 G M*^ defined as 



yi = SyVy 



^opt 



and y2 



'-opt 



Note that Uyyi = A^opt - b, Uyya = A±opt - b, S^f^^y^y^ = S^n^{A^opt - b), 
and S"'"r2"'"UYy2 = S"'"r2'^(Axop(— b). We need to bound ||UYy2|| in terms of ||UYyi||: 



1 - w'i^Sj ||u,,,,|^'|||s-«-Uvy.||^f lis^n-u^ 



Ak ln{2k/6)\ 
r j 



IIUYyill^. 



(a) and (c) use Lemma 13; (b) follows because ^opt is optimal for the coreset regres- 
sion. After reorganization, using < p + 1, and assuming r > 36A;ln(2(p + 



I Ax 



opt 



Hi < 1 1 + 3 



4(p + l)ln(2(p + l)/5) 



Setting 



36(p + l)ln(2(p+l)/5) 



gives the bound in the Theorem. The overall running time is the sum of two terms: 
the time to compute Uy via the SVD and the time to run the SubspaceSampling 
method on Uy (Lemma 13 with k < p + 1). ■ 
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Input: A e M"*^" of rank p = n, b G M™, < 5 < 1, and < e < 1/3. 
Output: Q e M"'''^ S G W''; r : 



72(n.+l) ln(2(n+l)/<5) log(40(n+l)m) 



1: Compute the matrix Uy G M™^'^ of the top k left singular vectors 
of Y = [A, b] G M™x("+i); h = rank(Y) < rank(A) + 1 = p + 1. 



2: Let r 



72(n.+l) ln(2(n+l)/(5) log(40(n+l)m) 



3: return [^2,8] = SuhspaceSampling{\]Y, 2iog(40fcm) ' ''^) (Lemma 13) 



Algorithm 8: Randomized "coreset" for constrained regression. 

5.3 "Coreset" Construction with the Hadamard Transform 

Recall the discussion in Section 3.1.10. First, notice that, since HD is a square 
orthonormal matrix, for every x, II Ax - b||^ = ||HDAx - HDb||^. So, it suffices to 
approximate a solution to the residual ||HDAx — HDbUg. In what follows, we essen- 
tially apply the randomized method of the previous section to the regression prob- 
lem involving HDA and HDb instead of A and b. Our algorithm here constructs 
coresets for the problem with HDA and HDb; unfortunately, this corresponds to 
"coresets" for the original problem involving A and b. The crux in the analysis of 
this section is that we do not need to compute the left singular vectors of the matrix 
Y = [A, b] G since, due to Lemma 25, after multiplying A and b with 

HD, the norms of the rows of the left singular vectors of Y are essentially known; so, 
applying Definition 12 and Lemma 13 with uniform probabilities qi = 1/m suffices 
to preserve the singular values of the sub-sampled matrix Uy, which is all we need 
to achieve in order to prove Theorem 43, the main result of this section. 



Theorem 44. Given A G M""''" of rank p = b G M™, < 5 < 0.95, and < e < 
|, Algorithm 8 constructs G M™''^ S G with r = 



72(n+l) ln(2(n+l)/(5) log(40(n+l)m) 
f2 



in time 

O (mralog ■ \n{n/6) ■ log(nm) ■ e~^)) 
such that 5topt ^ IR" obtained from S^Q^HDA, S"'"r2"'"HDb satisfies w.p. 0.95 — 5 



\A5copt - Hi < (1 + e) II Ax,pt - h\\l. 
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Proof. Let Y = [A,b] e M™x(n+i)^ g^^^ gyO: Y = UySyV^, where Uy G 



nmx fc 



Sy G M'^^'^ and Vy G M("+i)^^ Let Y = [HDA, HDb] = HDY G 



omx (n+1) 



and let its SVD: Y = UyJ^yV^, where Uy G 



pmxk 



, SyG 



,fcxfc ^j^^ ^ K(n+l)xfc_ 



It is important to note that Uy = HDUy, Sy = Sy, and Vy = Vy- Here k = 
rank(Y)p + 1 = rank(A) + 1. For the failure probability 6 of the Theorem, let 
r > Akln(2k/6) be a sampling parameter whose exact value we will be specified 
later. Let [fl,S] = SubspaceSampling I {Uy, 2iog(40fcm) ' ^) ^^"^ 



yi = EyVy 



^opt 



and 



y2 



^opt 



Note that UyYi = HDAx^pt - HDb, UYy2 = B.BA±opt - HDb, S^fi^UyYi 
S^fiT(HDAxopj - HDb), and S^fi^Uyya = S^l]T(HDAiopt - HDb). Now: 



i + ,/?^MHM)i2l(i2^!^)|||u,y.||i. 



(a) and (c) use Lemma 26; (b) follows because Xopt is optimal for the coreset regres- 
sion. After reorganization, using A; < and assuming r > 72A;ln(2A;/5) log(40/cm). 



iHDAiopt - HDb||2 < | 1 + 3 



8k\n{2k/5) log(40A;m) 



|HDAxopi-HDb||2. 



Setting 



72(p + 1) ln(2(p + 1)/S) log(40fcm) 



and removing HD from both sides gives the bound in the Theorem. The over- 
all running time is the sum of two terms: the time to compute Y = HD[A,b] 
(Lemma 24) and the time to run the SubspaceSampling method on Uy (Lemma 13 
with k < p + 1): O {mn log (n ■ ln(n/5) ■ log(?im) ■ e"^)) . ■ 



CHAPTER 6 

FEATURE SELECTION IN A^-MEANS CLUSTERING 

Consider m points V = {pi,P2, ■■■,Pm.} ^ "^^^ and integer k denoting the number 
of clusters. The objective of /c-means is to find a /c-partition of V such that points 
that are "close" to each other belong to the same cluster and points that are "far" 
from each other belong to different clusters. A /c-partition of P is a collection 
S = {Si,S2, ...,Sk} of k non-empty pairwise disjoint sets which covers V. Let 
Sj = \Sj\ be the size of Sj. For each set 5*^, let G M" be its centroid (the mean 
point): fij = {J2pieSjPi)/^i- "^^^ /c-means objective function is 

m 

J^{V,S) = J2\\P^-^^iP^)\\l 
1=1 

where i-i{pi) is the centroid of the cluster to which pi belongs. The goal of /c-means 
is to find a partition S which minimizes J-". 

For the remainder of this chapter, we will switch to a more convenient linear 
algebraic formulation of the /c-means clustering problem. Define the data matrix 
A G M"*^", which has the data points for its rows: A""" = [pi, . . . We represent 
a clustering S by its cluster indicator matrix X G M™^'^. Each column j = 1, . . . , k 
of X represents a cluster. Each row i = 1, . . . , m indicates the cluster membership 
of the point pi. So, Xjj = 1/y/Sj if and only if data point Pi is in cluster 5*^. Every 
row of X has exactly one non-zero element, corresponding to the cluster the data 
point belongs to. There are Sj non-zero elements in column j which indicates the 
data points belonging to cluster Sj. To see that the two formulations are equivalent, 

m m 

X) = II A - xx^Aii^ = ^ \\pj - p^x^Aii^ = hi - M(P.)^I|' = Hv, s). 

4 = 1 1=1 

^Portions of this chapter previously appeared as: C. Boutsidis, M.W. Mahoney and P. Drineas, 
Unsupervised Feature Selection for the fc-means Clustering Problem, Advances in Neural Informa- 
tion Processing Systems (NIPS), 2009, and as: C. Boutsidis, A. Zouzias and P. Drineas, Random 
Projections for fc- means Clustering, Advances in Neural Information Processing Systems (NIPS), 
2010. 
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After some elementary algebra, one can verify that for i = 1, m, p 

Using this formulation, the goal of fc-means is to find an indicator matrix X which 

minimizes ||A — XX"'"A||^. 

To evaluate the quality of different clusterings, without access to a "ground 
truth" partitioning (labels), we will use the fc- means objective function. Given 
some clustering X, we are interested in the ratio J-'(A, X)/J^(A, Xop^), where Xopt 
is the optimal clustering. The choice of evaluating a clustering this way is not new: 
all [112, 95, 77, 76, 57, 113, 6] provide results along the same lines. 

Below, we give the formal definitions of the /c-means problem and a /c-means 
approximation algorithm. Recall that our primarily goal in this thesis is to develop 
techniques that select features from the data; we do not design algorithms to cluster 
the data per se. A fc-means approximation algorithm is useful in our discussion since 
it will be used to evaluate the quality of the clusterings that can be obtained after 
our feature selection techniques, so we include this definition as well. 

Definition 45. [The k-means clustering problem] Given A e M™^" (repre- 
senting m points - rows - described with respect to n features - columns) and a posi- 
tive integer k denoting the number of clusters, find the indicator matrix Xopt G M™'^'^ 

Xopt = argmin ||A - XX^A||^. 

The optimal value of the k-means clustering objective is 

T{A, Xopt) = min || A - XX^A||| = || A - XoptXjp^A\\l = Topt- 

In the above, X denotes the set of all m x k indicator matrices X. 

Definition 46. [k-means approximation algorithm] An algorithm is a "7- 
approximation" for the k-means clustering problem ('y > 1) if it takes inputs A and 
k, and returns an indicator matrix X^ such that w.p. 1 — 6^: 

II A - X^X^AII^ < 7 min II A - XX^AHl = 7^(A, Xopt) = l^opt- 

An example of such an algorithm is [95] with 7 = l + e(0<e<l), and 8^ some 
constant in (0, 1). This method runs in 
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Input: Dataset A G M™^", number of clusters /c, and < e < |. 
Output: C G W^^'^' with r = 0{k\og{k)/e'^) rescaled features. 

1: Let Z = FastFrobemusSVD{A, k,e) (Lemma 8). 

2: Let r = Co ■ 4/c ln(200A;)/e^ (cq is a sufficiently large constant). 

3: Let [^,S] = SubspaceSampling{Z,l,r) (Section 3.1). 

4: Return C = AQS G M™-^^ with r rescaled columns from A. 

Algorithm 9: Randomized Feature Selection for A;-means Clustering. 

6.1 Feature Selection with Randomized Sampling 

Given A, k, and < e < 1/3, Algorithm 9 is our main algorithm for feature 
selection in fc-means clustering. In a nutshell, construct the matrix Z with the (ap- 
proximate) top-A; right singular vectors of A and select r = 0(A;log(A;)/e^) columns 
from Z""" with the randomized technique of Section 3.1.4. One can replace the first 
step in Algorithm 9 with the exact SVD of A; the result though is asymptotically 
the same as the one we will present in Theorem 47. Working with the approximate 
singular vectors Z gives a considerably faster algorithm. 

Theorem 47. Let A G M"*^" and k are inputs of the k-means clustering problem. 
Let e G (0,1/3) and, by using Algorithm 9 in 0{mnk/e + k\n.{k)/e^\og{khi{k)/e)) 
construct features C G R™^^ with r = 0{k\og{k)/e'^) . Run any '-/-approximation 
k-means algorithm on C, k and construct X^^. Then w.p. 0.21 — 6^: 

II A - X;^XTA||^ < (1 + (2 + e)7) || A - X^^tXlMl- 

To prove Theorem 47, we first need Lemma 48, which we prove in the Appendix. 

Lemma 48. Fix A, k, < e < 1/3 and < 5 < 1. Via Lemma 8, for some r, let 
[n, S] = SubspaceSampling{Z, 1, r). From Lemma 8: A = ATI27 + E. Then: 

1. For anyr>0 and w.p. 1 - 5: \\EnSS^ il'^ Z\\l < ^||E|||. 

2. Let r = 4k\n{2k/6) /e^ ; then w.p. 1 - 6: ||(ZTf]S)+ - (Z^fiS)T||2 < ^ 

3. ForsomeEe M'"^", let AZZ^ = A^]S(Z^^]S) + Z^+E; then, ifr = Ak\n{2k/6)/e^, 
w.p. 1-36: \\E\\f < ^||E||ir. 
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Proof, (of Theorem 47) We start by manipulating the term ||A — X;^X- A|||,. Re- 
placing A = BZ^ + E and using Matrix Pythagoras: 

II A - X^X^AIII = ||(I„ - X^X^)BZT||| + ||(I„, - X^X^)E||^ . (6.1) 

We first bound the second term of eqn. (6.1). Since Im — X^X? is a projector 
matrix, it can be dropped without increasing the Frobenius norm. From Lemma 8 
and Markov's inequality w.p. 0.99: ||E||| < (1 + 100e)||A - Afc|||,. Note also that 
XoptX^jA has rank at most k; so, overall, w.p. 0.99: 

9l < (l + 100e)||A-Afc|||< (l + 100e)||A-X,piXjp,A||| = (l + 100e)F„pi. 

We now bound the first term in eqn. (6.1): 



^3 



< ||(I^-X^X^)Af]S(ZTnS)+ZT||^+ ||E||^ (6.2) 



< ||(I^-X^XT)AfiS||^||(ZTfiS)+||2+||E||^ (6.3) 

'-opt) 



< V7||(I,„ - XopiXjp,)AfiS||i.||(ZTfiS) + ||2 + ||E||f (6.4) 



In eqn. (6.2), we used the third statement of Lemma 48, the triangle inequality, and 
the fact that 1^ — X^X^ is a projector matrix and can be dropped without increasing 
a unitarily invariant norm. In eqn. (6.3), we used spectral submultiplicativity and 
the fact that can be dropped without changing the spectral norm. In eqn. (6.4), 
we replaced X;^ by Xopt and the factor ^7 appeared in the first term. To better 
understand this step, notice that X;^, gives a 7- approximation to the optimal /c-means 
clustering of the matrix C = Af2S, so any other m x k indicator matrix (e.g. Xopj): 

II (I™ - X^X^) AnSWl < 7min ||(I™-XXT)A1]S||| < 7II (im - X^pXpt) Af]S|||. 

By using Lemma 14 with 6 = 3/4, Lemma 13, and the union bound on these two 
probabilistic events, w.p. 1 — j — 6: 



||(I™-X„,tX:,,)AfiS||^||(Z^fiS)+||2 < \lj^/opt. 
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We are now in position to bound 63; set 6 = 0.01. Assuming 1 < 7: 



4 1.6eVl + lOOe 



j V7v^< (V2 + 94e 



93<\\l + 1 V^V^opt < ( V2 + 94e ) V7 Vi^opf 

This bound holds w.p. 1 — | — 3 ■ 0.01 — 0.01 due to the third statement of Lemma 48 
and the fact that ||E||| < (l + 100e)|| A- Afc|||. w.p. 0.99. The last inequality follows 
from our choice of e < 1/3 and elementary algebra. Taking squares on both sides: 

el < (2 + 9162e)^Fopt. 

Overall (assuming 7 > 1): 

||A-X^X^A||| < ej + ej < (2 + 9162e)7Fopt + (l + 100e)F,pi < F,pt + {2 + 10h)^Fopt. 

Rescaling e accordingly (cq = 10®) gives the bound in the Theorem. The failure 
probability follows by a union bound on all the probabilistic events involved in the 
proof of this theorem. Indeed, | + 3 ■ 0.01 + 0.01 + 5^ = 0.79 + 5^ is the overall failure 
probability. ■ 



Existence of 0{k/e^) "good" features 

Theorem 47 proved that 0(A;log(A;)/e^) features suffice to preserve the clus- 
tering structure of a dataset within a factor of 3 + e. The technique that we used to 
select the features breaks for r = o(A;log(A;)) (due to Lemma 13 in Section 3.1.4). Re- 
cently, in [17], by using more sophisticated methods (extensions of the techniques of 
Sections 3.1.7 and 3.1.8 to sample columns from multiple matrices simultaneously), 
we proved the existence of 0{k/e^) features with approximation 8 + e. Unhappily, to 
find these features, the (deterministic) algorithm in [17] requires knowledge of the 
optimum partition Xopf, which is not realistic in any practical application. On the 
positive side, we managed to get a (randomized) algorithmic version with 0{k/e^) 
features but this gives only a 0(log(A;)) approximation. An interesting open question 
is whether one can get constructively (1 + e) or some constant-factor approximation 
with 0{k/e^) or even 0{k/e) features (hopefully with a "small" hidden constant). 
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Input: Dataset A G M™^", number of clusters /c, and < e < |. 
Output: C G R""""'' with r = 0{k/e^) artificial features. 

1: Set r = Cok/e^ for a sufficiently large constant Cq. 
2: Compute a random n x r matrix i? as follows. For all ^ = 1, n, 
J = l,...,r (i.i.d) 

^ ^ f + l/v^,w.p. 1/2, 
\-l/v/f,w.p. 1/2. 

3: Compute C = AR with the Mailman Algorithm (see text). 
4: Return C G M"''''. 

Algorithm 10: Randomized Feature Extraction for /c-means Clustering. 

6.2 Feature Extraction with Random Projections 

We prove that any set of m points in n dimensions (rows in a matrix A G 
M™^") can be projected into r = 0{k/e^) dimensions, for any e G (0,1/3), in 
0{mn\e~'^k/ log{n)'\) time, such that, with constant probability, the optimal k- 
partition of the points is preserved within a factor of 2 + e. The projection is done 
by post-multiplying A with an n x r random matrix R having entries +l/-\/r or 
— l/-\/r with equal probability. More specifically, on input A, k, and e, we construct 
C G R"''''" with Algorithm 10. 

Running time. The algorithm needs 0{mk/e^) time to generate R; then, the 
product AR can be naively computed in 0{mnk/e^). One though can employ the 
so-called mailman algorithm for matrix multiplication [98] and compute the prod- 
uct AR in 0(m?T,[e~^A;/log(n)] ). Indeed, the mailman algorithm computes (after 
preprocessing) a matrix- vector product of any ra-dimensional vector (row of A) with 
an n X log(n) sign matrix in 0{n) time. Reading the input n x logn sign matrix 
requires O(nlogn) time. However, in our case we only consider multiplication with 
a random sign matrix, therefore we can avoid the preprocessing step by directly 
computing a random correspondence matrix as discussed in [98, Preprocessing Sec- 
tion]. By partitioning the columns of our n x r matrix R into [r/log(?7,)] blocks, 
the claim follows. 



93 



Analysis. Theorem 49 is our quality-of-approximation-result regarding the 
clustering that can be obtained with the features returned from the above algorithm. 
Notice that if 7 = 1, the distortion is at most 2 + e, as advertised. If the 7- 
approximation algorithm is [95] the overall approximation factor would be (1 + (1 + 
e)^) with running time 0(mn[e~^/c/ log(n)] +2^^/''^°^^^mk/e'^). 

Theorem 49. Let A G M™-^" and k are inputs of the k-means clustering problem. 
Let e G (0, 1/3) and construct features C G M™-^'' with r = 0{k/e^) by using Algo- 
rithm 10 in 0{mn\e~'^k / \og{n)\) . Run any '-j- approximation k-means algorithm on 
C, k and construct X^^. Then w.p. 0.95 — 6^: 

II A - X;^XTA||^ < (1 + (1 + e)7) II A - X,,tXl,A\\l. 

Proof. We start by manipulating the term ||A — X^^X^AHI,. Replacing A = A^ + 
Ap_fc and using Matrix Pythagoras: 

||A-X^X^A||| = ||(I„-X^X^)Afc||| + \\il^-X^X'^)A,_k\\l. (6.5) 
^ V ' ^ V ' 

"1 "2 

We first bound the second term of Eqn. (6.5). Since 1^, — X^X- is a projector 
matrix, it can be dropped without increasing the Frobenius norm. So, by using this 
and the fact that XoptX^^A has rank at most k: 



el < WK^kWl =||A-Afe||^ < ||A-X„piXjp,A|||. (6.6) 
We now bound the first term of Eqn. (6.5): 

Oi < ||(I„,-X^X|)Ai?(Vfe/2)+V^||i. + ||E||i. (6.7) 

< ||(I^-X^X^)A/?||^||(Vfci?) + ||^ + ||E||,. (6.8) 

< V7ll(I„^-X„p,Xj,JAi?||^||(V,i?)+||^ + ||E||^ (6.9) 

< ^v/(rT^||(I„-X„,,Xj^JA||^^ + 46||(I„,-X„,,Xj,JA||^fi.lO) 

< V^(l + 2.5e)||(I„,-X,piXjpJA||^ +y^4e||(I„-X„piX:p,)A||^(6.11) 

< V7(l + 6.5e)||(I„,-X,pXpJA||^ (6.12) 
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Input: Dataset A G M™^", number of clusters k, and < e < 1. 
Output: C G W^^'' with k artificial features. 

1: Let Z = FastFrobeniusSVD{A,k,e) (Lemma 8). 
2: Return C = AZ G R"''"'. 

Algorithm 11: Randomized Feature Extraction for /c- means Clustering. 

In Eqn. (6.7), we used the sixth statement of Lemma 21, the triangle inequality 
for matrix norms, and the fact that 1^ — X^X^ is a projector matrix and can be 
dropped without increasing the Frobenius norm. In Eqn. (6.8), we used spectral sub- 
multiplicativity and the fact that can be dropped without changing the spectral 
norm. In Eqn. (6.9), we replaced X^y by Xopt and the factor ^/^ appeared in the 
first term. To better understand this step, notice that X;y gives a 7-approximation 
to the optimal /c- means clustering of the matrix C, and any other m x k indicator 
matrix (for example, the matrix Xop^) satisfies 

II (I„ - X^X^) C||^ < 7 min 11(1^ - XXT)C||| < 7II (l^ - ^opt^lt) C\\l. 

In Eqn. (6.10), we used the fifth statement of Lemma 21 with X = (I — XoptX^^JA, 
the first statement of Lemma 21 and the optimality of SVD. In Eqn. (6.11), we used 
the fact that 7 > 1 and that for any e G (0, 1/3) it is (a/1 + e)/(l - e) < 1 + 2.5e. 
Taking squares in Eqn. (6.12), we get 

el < 7(l + 28e)||(I„-X„,,Xj^,)A||^. 

Rescaling e accordingly gives the bound in the theorem. The failure probability 
0.05 + 6y follows by a union bound on all four probabilistic events involved in the 
proof of the theorem which are the 7-approximation /c-means algorithm with failure 
probability and the first, fourth, and sixth statements of Lemma 21 with failure 
probabilities 0.01, 0.01, and 0.03, respectively. ■ 
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6.3 Feature Extraction with Approximate SVD 

Finally, we present a feature extraction algorithm that employs the SVD to 
construct r = k artificial features. Our method and proof technique are the same 
with those of [40] with the only difference being the fact that we use a fast approx- 
imate randomized SVD from Lemma 8 as opposed to the expensive exact deter- 
ministic SVD. Our choice gives a considerably faster algorithm with approximation 
error similar to the approach in [40]. 

Theorem 50. Let A G M™-^"- and k are inputs of the k-means clustering problem. 
Let e G (0, 1) and construct features C G M"*^'^ by using Algorithm 11 in 0{mnk/e) 
time. Run any approximation k-means algorithm on C, k and construct X;^. Then 
w.p. 0.99 - 6^: 

||A - X;^X^A||| < (1 + (1 + e)7) ||A - Xopt'X.lMl- 

Proof. We start by manipulating the term || A — X^X?A|||n. Replacing A = BZ"'"-|-E 
and using Matrix Pythagoras: 

II A - X^X^AII^ = ||(I„ - X^X^)BZT||| + ||(I„ - X^X^)E||^ . (6.13) 

^ V ' ^ V ' 

el el 
In the proof of Theorem 47 we argued that w.p. 0.99: 

el<{l + 100e)||A - Afclll < (1 + 100e)||A - X^p^Xj^^AjH = (1 + ime)Fopt. 

We now bound the first term in eqn. (6.13): 

^3 < ||(I^-X;^X|)AZZT||^ (6.14) 

< ||(I^-X;^X^)AZ||^ (6.15) 

< v^||(I™-X„,,Xj^JAZ||^ (6.16) 

< v^||(I™-X,,,XjpJA||^ (6.17) 

In eqn. (6.14), we replaced B = AZ. In eqn. (6.15), we used spectral submulti- 
plicativity and the fact that ||Z"'"||2 = 1. In eqn. (6.16), we replaced X;y by Xopt 
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and the factor appeared in the first term (similar argument as in the proof of 
Theorem 47). In eqn. (6.17), we used spectral submultiplicativity and the fact that 
II Z II 2 = 1. Overall (assuming 7 > 1): 

II A - X^X^AIII < el + el < ^Fopt + (1 + 100e)F„pi < F^^t + (1 + lO^hF^pf 



Rescale e accordingly and use the union bound to wrap up. 



CHAPTER 7 
FUTURE DIRECTIONS 

This thesis presented a few aspects of the world of "Matrix Samphng Algorithms" . 
A popular problem in this line of research is low-rank approximations of matrices 
by using a small subset of their columns: given A and r, among all (") matrices C 
with r columns from A find one with 

C* = argmiuc ||A - CC+A|||. 

We learned various approaches, from Additive-error algorithms as in Section 3.1.1 to 
more sophisticated greedy techniques as in Chapter 4. We argued that the predom- 
inant goal of a "Matrix Sampling Algorithm" is to preserve the spectral structure 
of the top right singular vectors of the matrix (Sections 3.1.4 to 3.1.8). 

In another - parallel - line of research^, "Sparse Approximation Algorithms" 
are used for the following combinatorial problem: given A G M™^", b G M™, and 
r < n, among all (") vectors G M" with at most r non-zero entries, find one with 

x; = argmin|,^||^^<^ || Ax - b||2. 

In words, this problem asks to find a subset of columns from A (those columns that 

correspond to the non-zero elements in x*), such that the solution vector obtained 

by using these columns is the best possible among all (") such solutions. 

Notice that, in both problems, the goal is to find the "best" columns from 

the input matrix. What makes the problems different is the objective function that 

these columns attempt to minimize. So, arguably, the problems from these two 

communities are not "far" from each other. Unfortunately, it appears that there 

is little, if no, overlap in algorithmic approaches. An interesting avenue for future 

research is to explore the applicability of "Matrix Sampling Algorithms" for sparse 

approximation problems and vice versa. For example, 

^ "Topics in Sparse Approximations" from Joel Tropp [136] and "Topics in Compressed Sensing" 
from Deanna Needell [110] give a nice overview of this area. 



97 



98 



1. Are "Matrix Sampling Algorithms", such as those we presented in Section 3.1, 
useful in the context of sparse approximations? 

2. Are "Sparse Approximation Algorithms", such as, for example, the orthogonal 
matching pursuit and the basis pursuit [135], useful in the context of low-rank 
column-based matrix approximations? 

Towards this end, in [15] we made a little progress by leveraging the randomized 
technique of Section 3.1.4 and the deterministic technique of Section 3.1.7 to design 
novel sparse approximation algorithms. We now present the result of [15]. 

Theorem 51. Fix A G R™^", b G M™, and target rank < k < n. For r > 
lAAk ln{20k) , let [Qi,Si] = SubspaceSamphng{Vk,l,r) , and let G M" is the 
r-sparse vector obtained from AQi, h. Then, with constant probability: 

n**+, ,n /36fcln(20A;) 11 A - AJI^,,, „ 

Ax, - b 2 < AA+b - b 2 + \ ^ -r^ b 2. 

For r > 3Qk, let [^2,82] = BarrierSamplingIII(Vk, {-^ — ■^^k^J)'^ and let 
Xr G M" is the r-sparse vector obtained from AQ2, b. Then, 

IIAx. - b||2 < IIAA.+b - b||2 + (1 + l/f )^^=(^l|b|l2- 

Notice that the solution vectors obtained by our algorithms are evaluated with 
respect to the solution vector obtained by the so-called Truncated SVD Regular- 
ized approach, = A^b [74]. This is certainly interesting because the theorem 
essentially says that one can replace the fc-SVD solution with a sparse solution with 
almost k non-zero entries and get a comparable (up to an additive error term) per- 
formance. It would have been nice though to obtain bounds with respect to the 
optimum solution of the regression problem, Xopt = A^b. Although this is a much 
more harder bound to obtain, we believe that it is not quixotic to seek algorithms 
with such properties. To obtain such bounds, it might require the development 
of new greedy-like methods (such as those of Sections 3.1.6, 3.1.7, and 3.1.8) that 
attempt to optimize directly the objective of the underlying sparse approximation 
problem. 
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Low-rank Column-based Matrix Approximations. Consider approximations 
for tlie equation: || A— ^.(A)||^ < a|| A— Afc||g. For ^ = F and r = k, the problem is 
closed, modulo running time, since [34] provides a deterministic algorithm matching 
the known lower bound a = a = \/k + 1. For = 2 and r = k, current algorithms 
are optimal up to a factor 0{k'^). [66] provides a deterministic algorithm with 
a = 0{y/kn); the lower bound though indicates that a = \fnjk might be possible, 
so there is some hope to improve on this result. For ^ = 2 and r > k, the problem is 
closed, modulo running time and constants. We provided a deterministic algorithm 
with a = 0{\Jn/r) and the lower bound we proved is d = ^Jn/r. For = 2 and 
r > k, we provided a randomized algorithm which is optimal up to a constant 20. 
It would be nice though to design a deterministic algorithm with a better constant. 
[68] provides a deterministic algorithm with approximation: 



It would be nice to understand whether this result can be extended to the rank k 
matrix 11^^ (A) as well. Moreover, a faster algorithm than [68] should be possible. 
We believe that there exists such an algorithm running in SVD time. 

Coreset Construction in Least-Squares Regression. We believe that the 
deterministic algorithm that we presented in Section 6.1 is optimal up to a factor K 
It is known that relative error coresets of size 0{k) are not possible. Relative error 
coresets of size 0{k/e) should be possible. We hope such algorithms will become 
available soon; if not, it would be nice to develop lower bounds for this problem. 

Feature Selection in fc-means. The feature selection algorithm of Section 6.1 
gives a constant factor approximation to the optimal partition. We believe that 
relative error approximations should be possible. Further, a deterministic approach 
would get rid of the log(fc) factor from the number of sampled features, which is 
there due to Coupon Collector issues. We hope that deterministic relative-error 
algorithms with 0{k/e) features will become available soon; if not, it would be nice 
to understand the limitations of this problem by developing lower bounds. 
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APPENDIX A 
TECHNICAL PROOFS 



A.l Proof of Lemma 4 

Our proof for the Frobenius norm case is a mild modification of the proof of 
Lemma 4.3 of [30]. First, note that ng_^(A) = H^^^- (A), because Q G M'"^"^ is an 
orthonormal basis for the column space of C. Thus, 

l|A - nS,,(A)||| = ||A - ng,,(A)||^ = min ||A - Qnl- 

*;rank('i')<fc 

Now, using matrix-Pythagoras and the orthonormality of Q, 

||A - Q^lll = ||A - QQ^A + Q(QTA - = ||A - QQ^A||| + HQ^A - ^|||. 

Setting ^ = (Q^A)fc minimizes the above quantity over all rank-fc matrices 
Thus, combining the above results, ||A — n^;,(A)|||. = ||A — Q (Q"'"A)^, |||,. 

We now proceed to the spectral-norm part of the proof, which combines ideas 
from Theorem 9.3 of [73] and matrix-Pythagoras. Consider the following derivations: 

||A-Q(QTA)J|2 = ||A-QQTA + Q(QTA-(QTA)fe) ||2 

< ||A-QQTA||2+||QQTA-(QQTA)fc||2 

< ||A-n2Q,,(A)i|^ + ||A-A,||^ 



< 



2||A-n^,,(A)||^. 



The first inequality follows from the simple fact that (QQ^'^A)^ = Q (Q"'"A)^ and 
matrix-Pythagoras; the first term in (a) follows because QQ^A is the uncon- 
strained, not necessarily of rank at most k) best approximation to A in the column 
space of Q; the second term in (a) follows because QQ""" is a projector matrix, so: 



2 



IIQQ^ A - (QQ^ A)fc||^ = at^,{qq'A) < ai^,{A) = \\A- A^h^ 
The last inequality follows because ||A — Afc||2 < ||A — nQj.(A)||2. 
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A. 2 Proof of Lemma 5 

The optimality of 11^ ^(A) implies that 

l|A-4,,(A)|||< ||A-X||| 

over all matrices X G M™^" of rank at most k in the column space of C. Consider 
the matrix 

x = c(zTw)+zT. 

Clearly X is in the column space of C and rank(X) < k because Z G M"^^. Proceed 
as follows: 

||A- C(Z^W)+Z^||? = ||BZ^+ (A-BZ^)-(BZ^ + (A-BZ^))W(Z^W)+Z^ 

V ' ^ V ' 

A C=AW 

= ||BZ^ - BZ^W(Z^W) + Z^ + E + EW(Z^W) + Z^||^ 
||E + EW(ZTW)+ZT||| 

< ||E||^+ ||EW(Z^W) + Z^||^. 
(a) follows because, by assumption, 

rank(Z^W) = k, 



and thus 



which implies 



:zTw)(zTw)+ = ifc, 



BZ^ - B(Z^ W)(Z^W) + Z^ = 
(6) follows by matrix-Pythagoras because 



EW(ZTW)+ZTE^ = 0™x„ 



(recall that E = A — BZ"^ and EZ = Omxk by assumption). The lemma follows by 
spectral submultiplicativity because Z has orthonormal columns, hence ||Z||2 = 1. 
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A. 3 Proof of Lemma 7 

Consider the following algorithm, described in Corollary 10.10 of [73]. The 
algorithm takes as inputs a matrix A G R™^" of rank p, an integer 2 < k < p, 
an integer q > 1, and an integer p > 2. Set r = k + p and construct the matrix 

Y e M'"^^ as follows: 

1. Generate an n x r standard Gaussian matrix R whose entries are i.i.d. A/'(0, 1) 
variables. 

2. Return Y = (AA^)''AR e M™^^ 

The running time of the above algorithm is 0{mnrq). Corollary 10.10 of [73] presents 
the following bound: 

E[||A-YY+A||2] < (^1 + '''^^ V^ui{m,n} - fcj ||A - A^lh, 

where e = 2.718 . . .. This result is not immediately applicable to the construction 
of a factorization of the form A = BZ""" + E with Z G M"^'^; the problem is with the 
dimension of Y G K^^^, since t ik^. Lemma 60 below, which strengthens Corollary 
10.10 in [73], argues that the matrix Uy f.{A) "contains" the desired factorization 
BZ""". Recall that while we cannot compute Hy f,(A) efficiently, we can compute 
a constant-factor approximation, which is sufficient for our purposes. The proof 
of Lemma 60 is very similar to the proof of Corollary 10.10 of [73], with the only 
difference being our starting point: instead of using Theorem 9.1 of [73] we use 
Lemma 6 of our work. To prove Lemma 60, we will need several results for standard 
Gaussian matrices, projection matrices, and Holder's inequality. The following seven 
lemmas are all borrowed from [73]. 

^We should note that [73] contains a result that is directly applicable for a factorization BZ^ 
with Z G R"^'^; this is in eqn. (1.11) in [73]; one can use this result and get an approximation 
2 + e in Lemma 7; the new analysis improves that to a/2 + e. Also, it is possible to use the matrix 

Y to compute such a factorization since Z can have more than k columns. This will sacrifice the 
approximation bounds of our algorithms; for example, assume that we construct a matrix Z with 
2k columns; then, one needs to sample at least r > 2k columns to guarantee the rank assumption 
and the error bound, for example, in Theorem 34 will become 1 + 1/(1 — ^y2k/r). Note that such 
a Z can be computed by applying the algorithm on A''"; then Y = Z. Orthonormalizing Z is also 
not necessary (this requires a slightly different argument in Lemma 5 that we do not discuss here). 
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Lemma 52 (Proposition 10.1 in [73]). Fix matrices X, Y, and draw a standard 
Gaussian matrix R of appropriate dimensions. Then, 

E[||XRY||2] < ||X||2||Y||f+ ||X||p||Y||2. 

Lemma 53 (Proposition 10.2 in [73]). For k,p > 2, draw a standard Gaussian 
matrix R G Then, 

E [IIR-IIJ < 

where e = 2.718 . . .. 

Lemma 54 (Proposition 10.1 in [73]). Fix matrices X, Y, and a standard Gaussian 
matrix R of appropriate dimensions. Then, 

E [||XRY|||] = ||X|||||F|||. 

Lemma 55 (Proposition 10.2 in [73]). For k,p > 2, draw a standard Gaussian 
matrix R G R'^xC'+p). Then, 

Lemma 56 (proved in [73]). For integers k,p > 1, and a standard Gaussian R G 
^kx{k+p) ^^j^^ of a is equal to k with probability one. 

Lemma 57 (Proposition 8.6 in [73]). Let P be a projection matrix. For any matrix 
X of appropriate dimensions and an integer q > 0, 

||px||2 < {\\F{xx^yx\\2)^ 

Lemma 58 (Holder's inequality). Let x be a positive random variable. Then, for 
any h > 1, 

E[x] < (E [x^]y . 
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The following lemma provides an alternative definition for 11^^ (A) which will be 
useful in subsequent proofs. Recall from Section 2.1 that we can write nQ;.(A) = 
CX^, where 

= argmin ||A - C^|||. 

^giRrxn.rank('i')<fc 

The next lemma basically says that 11^ ,^.(A) is the projection of A onto the rank-/c 
subspace spanned by CX^, and that no other subspace in the column space of C is 
better. 

Lemma 59. For A e W'''' and C G M'"'''', mteger r > k, let Ill ,^{A) = CX«, and 
Y G R*"^" be any matrix of rank at most k. Then, 

||A - CX^Wl = ||A - (CX«)(CX«)+A||| < ||A - (CY)(CY)+A|||, 

where Y eW^"^ is any matrix of rank at most k. 

Proof. The second inequality will follow from the optimality of X^ because Y(CY)"'' A 
has rank at most k. So we only need to prove the first equality. Again, by the 
optimality of X^ and because X^(CX^)+A has rank at most k, ||A — CX^||| < 
||A-(CX«)(CX«)+A|||. To get the reverse inequality, we will use matrix-Pythagoras 
as follows: 



|A-CX«||| = II (I„-(CX«)(CX«)+) A-CX«(I„-(CX«)+A)||| 
> II (I„-(CX«)(CX«)+) A|||. 



Lemma 60 (Extension of Corollary 10.10 of [73]). Let A be a matrix in M™^" of 
rank p, let k be an integer satisfying 2 < k < p, and let r = k + p for some integer 
p > 2. Let R G M"^'' be a standard Gaussian matrix (i.e., a matrix whose entries are 
drawn in i.i.d. trials from Af{0, 1)). Define B = (AA"'")^A and compute Y = BR. 
Then, for any q > 0, 

E[||A-n|,,(A)||2] < (l + + "'^^^ Vmin{m,r2} - k) HA-A^lh 



115 



Proof. Let n^^^(A) = YXi and n^^^(B) = YX2, where Xi is optimal for A and X2 
for B. From Lemma 59, 

l|A - n|,,(A)|h = - (YXO(YXi)+)A|h < - (YX2)(YX2)+)A|h. 

From Lemma 57 and using the fact that Im — (YX2)(YX2)''' is a projection, 

T^i^ A II 2^ 



- (YX2)(YX2)+)A||2 < II {Im - (YX2) (YX2)+) (AA^)" A 



12 



|B- (YX2)(YX2)+B| 



|B-n^.(B)|ir-, 



29+1 



where the last step follows from Lemma 59. We conclude that 

l|A-n|,,(A)||2< ||B-n|,,(B)|||^. 

Y is generated using a random R, so taking expectations and applying Holder's 
inequality, we get 

E[||A-n|,,(A)||2] < (E[||B-n|,,(B)||2])^. (A.l) 

We now focus on bounding the term on the right-hand side of the above equation. 
Let the SVD of B be B = UbSbV^, with the top rank k factors from the SVD of B 
being Ub./c, ^B,k, and Vb,a: and the corresponding trailing factors being Ub,t, ^b,t 
and Vb,t- Let pb be the rank of B. Let 

= V^^fcR e M^^" and ^2 = V^_,R G M(''B-fc)xr_ 

The Gaussian distribution is rotationally invariant, so Qi, Q2 are also standard 
Gaussian matrices which are stochastically independent because Vb can be extended 
to a full rotation. Thus, Vb^^R and Vb^R also have entries that are i.i.d. A/'(0, 1) 
variables. We now apply Lemma 6 to reconstructing B, with ^ = 2 and W = R. 
The rank requirement in Lemma 6 is satisfied because, from Lemma 56, the rank of 
Qi is equal to k (as it is a standard normal matrix), and thus the matrix R satisfies 
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the rank assumptions of Lemma 6. We get that, with probabihty 1, 



|B - ,(B)||^ < ||B - BkWi + ||(B - Bfc)R(V^' ,R)+||^ < ||Sb,.||^ + \\T.^^r^2^t\\l 



Using ^J + y"^ < x+y, we conclude that ||B — Ey fc(B)||2 < ||SB,r||2 + ||SB,T^2^^ lb- 
We now need to take the expectation with respect to Qi,Q2- We first take the 
expectation with respect to Q2, conditioning on Qi. We then take the expectation 
w.r.t. Qi. Since only the second term is stochastic, using Lemma 52, we have: 



[II^B,T^2^l'||2|^l] < ||Sb,t||2||^^i'||f + ||Sb,t||f||^i'||2- 

We now take the expectation with respect to Qi. To bound the term E [||f2]^||2] , we 
use Lemma 53. To bound the term E [Uri^^lli?] , we first use Holder's inequality to 

1/2 

bound E [Wnth] < E [IIO+III] ^ , and then we use Lemma 55. Since ||Sb,t||f < 
■\/min(m, n) — /c||EB,r||25 collecting our results together, we obtain: 



E[||B-n^,,(B)||2] < 1 



p — 1 



p 



\/ min(m, n) — k ) ||B — Bfc||2. 



To conclude, combine with eqn.( A.l) and note that ||B — Bfc||2 = || A — Afc||2''''~^. ■ 

We now have all the necessary ingredients to prove Lemma 7. Let Y be the 
matrix of Lemma 60. Set p = k and 



log (1 + + ^ Vmin{m, n} - k 

21og(l + e/V2) - 1/2 



so that 



Then, 



2q+l 



l + h ^^\/min{m, n} — A; I <1 1 ^■ 

p-1 p I V2 



E[||A-n2.^,(A)||2] < (1 + ^) IIA-A.I 



(A.2) 
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Given Y, let Q be an orthonormal basis for its column space. Then, using the algo- 
rithm of Section 2.1 and applying Lemma 4 we can construct the matrix Q (Q"'"A)^ 
such that 

||A - Q (Q^A) JI2 < v^llA - n^,,(A)|h. 

Clearly, Q (Q"^A)^ is a rank k matrix; let Z G R"^'^ denote the matrix containing 
the right singular vectors of Q (Q"^A)^, so Q (Q"^A)^ = XZ""". Note that Z is equal 
to the right singular vectors of the matrix (Q"'^A)^_ (because Q has orthonormal 
columns), and so Z has already been computed at the second step of the algorithm 
of Section 2.1. Since E = A - AZZ^ and || A - AZZ^Ha < || A - XZ^Hs for any X, 
we have 

IIEII2 < ||A - Q (Q^A)^, II2 < v^llA - Ul,{A)y. 

Note that by construction, EZ = Om.xfc- The running time follows by adding the 
running time of the algorithm at the beginning of this section and the running time 
of the algorithm of Lemma 2.1. 

A. 4 Proof of Lemma 8 

Consider the following algorithm, described in Theorem 10.5 of [73]. The 
algorithm takes as inputs a matrix A G M™^" of rank p, an integer 2 < k < p, and 
an integer p >2. Set r = k + p and construct the matrix Y G R"^^*" as follows: 

1. Generate an n x r standard Gaussian matrix R whose entries are i.i.d. A/'(0, 1). 

2. Return Y = AR G R'"''^ 

The running time of the above algorithm is 0{mnr). Theorem 10.5 in [73] presents 
the following bound: 

E [II A - YY+A||^] < (^1 + ' II A - A,||^. 

The above result is not immediately applicable to the construction of a factorization 
of the form A = BZ^ + E fas in Lemma 8) because Y contains r > k columns. 
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Lemma 61 (Extension of Theorem 10.5 of [73]). Let A be a matrix in M™^" of 
rank p, let k be an integer satisfying 2 < k < p, and let r = k + p for some integer 
p > 2. Let R G M"^^ be a standard Gaussian matrix (i.e., a matrix whose entries 
are drawn in i.i.d. trials from A/'(0, 1)^ and compute Y = AR. Then, 

E[\\A- Ul,{A)\\l] < (^1 + ||A - A,|||. 

Proof. We construct the matrix Y as described in the beginning of this section. Let 
the rank of A be p and let A = UaSaVJ be the SVD of A. Define 

= VJR G M'^^" and ^2 = Vj.^^R G M(^-*^)><^ 

The Gaussian distribution is rotationally invariant, so Qi, Q2 are also standard 
Gaussian matrices which are stochastically independent because V""" can be extended 
to a full rotation. Thus, V^R and Vj.^R also have entries that are i.i.d. Af{0, 1) 
variables. We now apply Lemma 6 to reconstructing A, with C, = F and W = R. 
Recall that from Lemma 56, the rank of Qi is equal to k with probability 1, and 
thus the matrix R satisfies the rank assumptions of Lemma 6. We have that, with 
probability 1, 



|A - n^fc(A)||^ < ||A - Afcll^ + WEp^k^tWl 



2 



where A — A^ = Up_/tSp_feVj„^.. To conclude, we take the expectation on both 
sides, and since only the second term on the right hand side is stochastic, we bound 
as follows: 

E[||Sp„,fi2f^+|||] En,[En,[\\J:,^kn2nt\\l\n^]] 
= Ej^i [ll^p-fcllFll^i'llF] 

p — 1 

(a) follows from the law of iterated expectation; (6) follows from Lemma 54; (c) 
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follows because is a constant; (d) follows from Lemma 55. We conclude 

the proof by noting that HSp^fcHi? = ||A — Afc||ir. ■ 

We now have all the necessary ingredients to conclude the proof of Lemma 8. Let Y 
be the matrix of Lemma 61, and let Q be an orthonormal basis for its column space. 
Then, using the algorithm of Section 2.1 and applying Lemma 4 we can construct 
the matrix Q (Q^A)^ such that 

l|A-Q(Q^A)J|^=||A-n^,,(A)|||. 

Clearly, Q (Q"^A)^ is a rank k matrix; let Z G M.""^^ be the matrix containing the 
right singular vectors of Q (Q-'^A)^, so Q (Q"^A)^ = XZ^. Note that Z is equal 
to the right singular vectors of the matrix (Q^'^A)^ (because Q has orthonormal 
columns), and thus Z has already been computed at the second step of the algorithm 
of Section 2.1. Since E = A — AZZ^ and 



|A- AZZ^II^ < ||A-XZ^|L 



for any X, we have 



|E||| < ||A - Q (Q^A) Jl^ = ||A - n|,,(A)|||. 



To conclude, take expectations on both sides, use Lemma 61 to bound the term 
E [||A-n|^;,(A)|||], and set 

P - 



e 



to obtain: 



E[||E|||]<(l + ^) IIA-A 



By construction, EZ = Omxfc- The running time follows by adding the running 
time of the algorithm at the beginning of this section and the running time of the 
algorithm of Section 2.1. 
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A. 5 Proof of Lemma 14 

Let 

X = \\Yns\\l 

be a random variable with nonnegative values. Assume that the following equation 
is true: 



E 0|Yf]S|||] = ||Y| 



Applying Markov's inequality to this equation gives the bound in the lemma. All 
that it remains to prove now is the above assumption. Let 

X = YQS e 

and for t = 1, ...,r, let X*-*-* denotes the t-th column of X = Yr2S. We manipulate 
the term E [||YfiS|||,] as follows: 



E OlYfiSIII] 



(a) 



(c) 



E 



t=i 



t=i 

r n 



-W||2 



{£) 1 
r 



t=i j=i 

r 



t=l 

I ^ \\f 



(a) follows by the definition of the Frobenius norm of X. (6) follows by the linearity 
of expectation, (c) follows by our construction of fi, S. (d) follows by the definition 
of the Frobenius norm of Y. 
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A. 6 Proof of Lemma 15 

Proof. Let = [X, 0„_ 

mxn]) SO that Y G M"^" is a square matrix whose first m 
rows are X""^ and whose last n — m rows are zero. Run Algorithm 4 of [66] with 
inputs Y and k. The bounds for the singular values are exactly those of Theorem 
3.2 of [66], which hold because the singular values of Y and X are identical. The 
only concern is the running time. The run time directly from [66] is 0(n^A; logj n), 
which treats Y as a full n x n matrix; but we argue below that the algorithm can 
be implemented more efficiently in 0{mnk log ju) time because we can effectively 
ignore the padding with zeros in all computations. First, we observe that the QR 
factorization of YII for any permutation 11 can always be written 



YR = QR 



On— mxm ^n— mxn—m 



On— mxfc On— mxn— fc 



where Qx € M."'''"' and 

On— mxn— fc 

QR factorization on the upper unpadded part, so: 



; effectively we only need to perform a 



x^n = Q 



X 



Om— fcxfc Cfc 



This has an important consequence to Algorithm 4 in [66]. In what follows we 
assume the reader is familiar with the notation in [66]. There are two basic steps 
in the algorithm. The first is to compute a function such as p(R, k) to determine 
which two columns to permute. This function depends only on 7*(Cfe) = 7*(Cfc) 
by construction, because the padded zeros contribute nothing to these norms. The 
second step is to refactorize and obtain the new Q and R, or more specifically to 
update a;*(Afc), 7*(Cfc) and A^^B^. As observed above, we just need to refactorize 
X^n, which means we simply need to run the the efficient update steps in [66] for 
Afc, Bfc and C^. We are effectively running the algorithm on X, ignoring the padding 
completely. So, from [66] the run time is 0{mnklogjn), as claimed. ■ 
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A. 7 Proof of Lemma 19 

We first restate the Lemma in a slightly different notation (actually this is the 
notation used in [10] so we chose to be consistent with existing literature). 

Lemma 62 (Dual Set Spectral Sparsification.). Let V = {vi,...,v„} and U = 
{ui, . . . , u,„} be two equal cardinality decompositions of the identity, where Vj G M*^ 
(k < n), VLi E ^ n), and: 'Y17=i^i'^J = Ifc one? ^11=1^1^ ~ ^i- Given an 

integer r with k < r < n, there exists a set of weights Si > (i = 1, . . . ,n) at most 
r of which are non-zero, such that 




and Ai s^UjuJ^ ~ ^ ^ 



The weights Si can be computed deterministically in O {rn [k"^ + time. 

In matrix notation, let U and V be the matrices whose rows are the vectors Uj 
and Vj respectively. We can now construct the sampling and rescaling matrices 
VL G R"^*", S G W^^ as follows: for i = 1, . . . ,n, if Sj is non-zero then include e, 
as a column of Q and y/sl as the i-th diagonal element of S; here is the i-th 
standard basis vector^. Using this matrix notation, the above lemma guarantees 
that (Tfc (V^fiS) > 1 - and ai (U^l^S) < 1 + y^. Clearly, and S may 
be viewed matrices that sample and rescale r rows of U and V (columns of U""" 
and V"""), namely the rows that correspond to non-zero weights Sj. Next, we prove 
Lemma 62. 

Proof of Lemma 62 

Lemma 62 generalizes Theorem 3.1 in [10]. Indeed, setting V = U reproduces 
the spectral sparsification result of Theorem 3.1 in [10]. We will provide a construc- 
tive proof of the lemma and we start by describing the algorithm that computes the 
weights Sj, -i = 1, . . . , n. 



^Notc that wc slightly abused notation: indeed, the number of columns of is less than or 
equal to r, since at most r of the weights are non-zero. Here, we use r to also denote the actual 
number of non-zero weights, which is equal to the number of columns of the matrix fJ. 
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Input: 




• V = {vi, . . . , v,„}, with Ya=i ^i^I = Ifc {k < n) 




• W = {ui, . . . , u„}, with YTi=i ^i^l = h{^< n) 




• integer r, with k < r < n 




Output: A vector of weights s = [si, . . . ,Sn], with > and at most 
r non-zero Sj's. 


1. Initiahze Sq = 0„xi, Aq = Okxk, Bq = Oixi- 




z. ror T — u, r — i 




• Compute Lt- and from eqn. (A. 6). 

• Kind an index 7 in -fl ti\ surh that 




?7(uj,5u,B^,u^) < L(vj,5l, A^,L^). 


(A.3) 


• Let 




2 


(A.4) 


• Update the jth component of s, A^ and B^: 




Sr+ib] = Sr[j]+t, A^+i = A^+tVjvJ, aud B^+i = B^- 


(A.5) 


3. Return s = r^^ ^1 — -y/fc/r^ ■ s^. 





Algorithm 12: Deterministic Dual Set Spectral Sparsification. 



The Algorithm. The fundamental idea underlying Algorithm 12 is the greedy 
selection of vectors that satisfy a number of desired properties in each step. These 
properties will eventually imply the eigenvalue bounds of Lemma 62. We start by 
defining several quantities that will be used in the description of the algorithm and 
its proof. First, fix two constants: 
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1 

5l = 1; = — 



1 



Given k, i, and r (all inputs of Algorithm 12), and a parameter r = 0, . . . ,r — 1, 
define two parameters L^- and U,- as follows: 

W = r I - - y - 1 = T-Vrk;\Jr = ^ ^ ^ ^ ^ = 5^ ( r + Vir 



(A.6) 

We next define the lower and upper functions ^(l, A) (l G M and A G M'^^'^) and 
0(U, B) (u G M and B G M^'^O as follows: 

Let L(v, 5l, a, l) be a function with four inputs (a vector v G R'^'^^, 5l G I^, a matrix 
A G R'''"', and L G M): 

^' ^) = Irlt^^Al'^irfll - v-(A - (L + 50i.)-^v. (A.8) 

^(L + ^L, A) -^(L, A) 

Similarly, let U (v, 6^, B, u) be a function with four inputs (a vector u G M.^^^, 6ij G M, 
a matrix B G M^''^ and U G M): 

C/(u. i„, B, u) = "^yj'f'Tu, + <5»)I. - B)-'u. (A.9) 

0(u,B) -0(u + 5u,B) 

Algorithm 12 runs in r steps. The vector of weights Sq is initialized to the all- 
zero vector. At each step r = 0, ... ,r — 1, the algorithm selects a pair of vectors 
(uj,Vj) that satisfy eqn. (A. 3), computes the associated weight t from eqn. (A. 4), 
and updates two matrices and the vector of weights as specified in eqn. (A. 5). 

Running time. The algorithm runs in r iterations. In each iteration, we evaluate 
the functions t/(u, 5u, B, u) and L(v,5l5A,l) at most n times. Note that all n 
evaluations for both functions need at most 0{k^ + nk"^ + + ni"^) time, because the 
matrix inversions can be performed once for all n evaluations. Finally, the updating 
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step needs an additional 0(/c^ + time. Overall, the complexity of the algorithm 
is of the order 0{r{k^ + nk"^ + £^ + nf + k"^ + f)) = O {rn {k"^ + i^)). 

Note that when U is the standard basis {U = {ei, . . . ,e„} and i = n), the 
computations can be done much more efficiently: the eigenvalues of Bg need not be 
computed explicitly (the expensive step), since they are available by inspection, be- 
ing equal to the weights s,-. In the function U{u, 6i,, B, u), the functions (given the 
eigenvalues) need only be computed once per iteration, in 0{n) time; the remain- 
ing terms can be computed in 0(1) because (for example) e^((u + 6i,)I — B)^^ej = 
(u + 5u — s[z])~^. The run time, in this case, drops to 0{rnk'^), since all the operations 
on U only contribute 0{rn). 

Proof of Correctness. We prove that the output of Algorithm 12 satisfies Lemma 
62. Our proof is similar to the proof of Theorem 3.1 [10]. The main difference is 
that we need to accommodate two different sets of vectors. Let W G R"*^™- be a 
positive semi-definite matrix with eigendecomposition 

m 

W = ^A,(W)u,u7 

1=1 

and recall the functions ^(l, W), </)(u, W), L(v, W, l), and f/(v, 6i,, W, u) defined 
in eqns. (A. 7), (A. 8), and (A. 9). We now quote two lemmas proven in [10] using 
the Sherman-Morrison- Woodbury identity; these lemmas allow one to control the 
smallest and largest eigenvalues of W under a rank-one perturbation. 

Lemma 63. F%x 6^>0,We M"^^"", v e M™, and L < A^(W). If t > satisfies 

<L(v,5„W,L) 

then Am(W + tvv'^) > L + 5l- 

Lemma 64. Fix 5^ > 0, W G i?"'^"', v e EJ^ , and U > Ai(W). If t satisfies 

t-' > f/(v,5u,W,u), 

then Ai(W + tvv^) < U + 5u- 
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Now recall that Algorithm 12 runs in r steps. Initially, all n weights are set 
to zero. Assume that at the r-th step (r = 0, . . . , r — 1) the vector of weights 
s,- = [St-[1], . . . , s^[n]] has been constructed and let 

n n 

A^ = s^[i]vjV^ and = ST-[z]ujU^. 

i=l i=l 

Note that both matrices A,- and B,- are positive semi-definite. We claim the following 
lemma which guarantees that the algorithm is well-defined. 

Lemma 65. At the r-th step, for all t = 0, ... ,r — 1, there exists an index j in 
{1, . . . ,n} such that setting the weight t > as in eqn. (A. 4) satisfies 

t/(u,-,5u,B,,U,) <r^ <L(vj,5l,A,,W). (A.IO) 

Proof of Lemma 65. In order to prove Lemma 65 we will use the following 
averaging argument. 

Lemma 66. At any step r = 0, . . . , r — 1, 

" fk 
^t/(ui,(5u,B^,U^) < 1 - -J - < ^L(vi,(5L,B^, w). 

1=1 i=l 

Proof. For notational convenience, let (j)^ = <^(Ur,BT-) and let (J)^ = ^(l^,A^). At 
r = 0, Bq = and Aq = and thus 0g = i/Vo and = —k/hQ. Focus on the 
r-th step and assume that the algorithm has run correctly up to that point. Then, 
4>T ^ 00 — ^0- Both are true at r = and, assuming that the algorithm has 

run correctly until the r-th step. Lemmas 63 and 64 guarantee that (p^ and are 
non-increasing. 

First, consider the upper bound on U. In the following derivation, Aj denotes 



127 



the i-th eigenvalue of B^-. Using Tr(u'^Xu) = Tr(Xuu'^) and Ujuf = I^, we get 

1 e 

Ell (u.+,-A.)(o.-A.) ^=l - 



^ (u^ - Ai)(u^+i - Xi) 
< ^ + 00- 

The last line follows because the last two terms are negative (using the fact that 
Ut-+i > > Xi) and 0^ < 0o- Now, using 0o = SuVri and the definition of 6^ the 
upper bound follows: 

In order to prove the lower bound on L we use a similar argument. Let Xi denote 
the z-th eigenvalue of A^. Then, 



i=l 

V^fc 1 k 



y L 

a. - T 



1 

5^ 



> --4 + ^- 



Assuming £ > the claim follows immediately because 5l = 1 and = —k/hQ = 
k/\/rk = \/kJr. Thus, we only need to show that S > 0. From the Cauchy-Schwarz 
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inequality, for ai.hi > 0, {Y^iaihif < {Y.if^'ih) (T^ih) and thus 



— ^ (Aj — LT-+i)(Aj — L,-) 5l "~ (Ai — L^+i)2(Aj — L^) Vfbi^ ("^^ ~ LT-)(Aj — Lt-_|_i) 



> 



1 A 1 



'''|J(A.-w„P(A.-W)|Ja,-l, ''"^V (Ai - L„i)2(A. - l; 

To conclude our proof, first note that 5^^ — > — ^l^o ~ ^ ~ \/k/r > 
(recall that r > k). Second, Aj > L^+i because 

11/7 

Amin(A^) >Lr + — >W + — = Lr+, y>W + l= W+l. 

Combining these two observations with the later equation, we conclude that S > 0. 



Lemma 65 now follows from Lemma 66 because the two inequalities must hold 
simultaneously for at least one index j. Once an index j and a weight t > have 
been computed. Algorithm 12 updates the j-th weight in the vector of weights to 
create the vector of weights 8,-+!. Clearly, at each of the r steps, only one element 
of the vector of weights is updated. Since Sq is initialized to the all-zeros vector, 
after all r steps are completed, at most r weights are non-zero. The following lemma 
argues that Amin(Ar) and Amax(Br) are bounded. 

Lemma 67. At the r-th step, for all t = 0, . . . , r — 1, Aniin(AT-) > and Amax(BT-) < 

Proof. Recall eqn. (A. 6) and observe that Lq = —\frk < and Uq = Sij\/ri > 0. 
Thus, the lemma holds at r = 0. It is also easy to verify that L^+i = Lt- + 5l, and, 
similarly, Ut-_|_i = u^- -|- 6^. Now, at the r-step, given an index j and a corresponding 
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weight t > satisfying eqn. (A. 10), Lemmas 63 and 64 imply that 

Amm(Ar+l) = Amin(Ar + tVjvJ) > W + 5l = W+l] 
Amm(BT-+l) = Ainax(BT- + tUjuJ) < Vr + = Ur+l- 

The lemma now follows by simple induction on r. ■ 

We are now ready to conclude the proof of Lemma 62. By Lemma 67, at the r-th 
step, 

Amax(Br.) < Uj. and Ajnm(Ar.) > L,.. 

Recall the definitions of U,. and from eqn. (A. 6) and note that they are both 
positive and well-defined because r > k. Lemma 62 now follows after rescaling the 
vector of weights s by r^^ ^1 — \/kJr^ ■ Note that the rescaling does not change the 
number of non-zero elements of s, but does rescale all the eigenvalues of A^ and B^. 



A. 8 Proof of Lemma 20 

Again, we first state Lemma 20 in a slightly different notation. 

Lemma 68 (Dual Set Spectral-Frobenius Sparsification.). Let V = {vi, . . . , v„} be 
a decomposition of the identity, where Vj G M'^ (k < n) and 'Yll=i^i^J — ^k', 
A = {ai,...,a„} be an arbitrary set of vectors, where a^ G M^. Then, given an 
integer r such that k < r < n, there exists a set of weights Sj > (i = 1 . . .n), at 
most r of which are non-zero, such that 

The weights Si can be computed deterministically in O {rnk"^ + n£) time. 

In matrix notation (here A denotes the matrix whose columns are the vectors aj), 
the above lemma argues that ak (V^nS) > 1 - and ||A1]S||f < ||A||i.. fi, S 
may be viewed as matrices that sample and rescale r columns of A and V""", namely 
the columns that correspond to non-zero weights Sj. Next we prove Lemma 68. 
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Proof of Lemma 68 

In this section we will provide a constructive proof of Lemma 68. Our proof 
closely follows the proof of Lemma 62, so we will only highlight the differences. 
We first discuss modifications to Algorithm 12. First of all, the new inputs are 
V = {vi, . . . , v„} and A = {ai, . . . , a„}. The output is a set of n non-negative 
weights Si, at most r of which are non-zero. We define the parameters 



1 



(5l = 1; (5u = ^ " A W = r - Vrk; = t6jj, 



for all r = 0, . . . , r — 1. Let denote the vector of weights at the r-th step of 
Algorithm 12 and initialize Sq and Aq as in Algorithm 12 (Bq will not be necessary). 
We now define the function Up (a, 6u), where a G and 61, G M: 

f/p (a, 5u) = 5-^a'^a. (A.12) 

Then, at the r-th step, the algorithm will pick an index j and compute a weight 
t > such that 

Upisij, 5u) < < Li^j, 5l, a,, w). (A.13) 
The algorithm updates the vector of weights s,- and the matrix 

n 

A^ = ^ Sr,i^ri\J. 
i=l 

It is worth noting that the algorithm does not need to update the matrix 

n 



T 



because the function Up does not need B,- as input. To prove the correctness of the 
algorithm we need the following two intermediate lemmas. 

Lemma 69. At every step r = 0, . . . , r — 1 there exists an index j in {1, ... ,n} that 
satisfies eqn. (A.13). 

Proof. The proof is very similar to the proof of Lemma 65 (via Lemma 66) so we only 
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sketch the differences. First, note that the dynamics of L have not been changed 
and thus the lower bound for the average of L(vj, 5l, A,-, L,-) still holds. We only 
need to upper bound the average of Upisii, 6u) as in Lemma 66. Indeed, 

n n n 

^f/i7(ai,(5u) = S^^'^ajai = ^u^^ ||a,||2 = 1 - 

i=l i=l 1=1 

where the last equality follows from the definition of 6^. ■ 

Lemma 70. Let W G M^^^ be a symmetric positive semi-definite matrix, let a G 
be a vector, and /ei U G M satisfy U > Tr(W). Ift>0 satisfies 

f/F(a,(5„) <r\ 

then Tr (W + tvv^) < U + 5u. 

Proof. Using the conditions of the lemma and the definition of Up from eqn. (A. 12), 

Tr(W + taa^) - U - 5„, = Tr(W) - U + ta^a - (5^, 

< Tr(W) - U < 0, 

which concludes the proof of the lemma. ■ 
We can now combine Lemmas 63 and 70 to prove that at all steps r = 0, . . . , r — 1, 

Amin(A^) > Lr and Tr(B^) < u^. 

Note that after all r steps of the algorithm are completed, = r ^1 — \Jk/r^ and 

= r {l — y^k/r^ Xir=i II ^« II 2- A simple rescaling now concludes the proof. The 
running time of the (modified) Algorithm 12 is O {nrk^ + ni), where the latter term 
emerges from the need to compute the function Upisij, 5u) for all j = 1, . . . , n once 
throughout the algorithm. 
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A. 9 Generalizations of Lemmas 19 and 20 

Lemma 71. Let X G M."-^^ and Y G R"^^ with respective ranks px, Py! px < ^ < 
n. One can deterministically construct in 0{rn{p\ + p\)) time a sampling matrix 
Vt G M"^'" and a positive diagonal rescaling matrix S G R''^'' such that 

m^n^Yh < (i - ^|^) m^rh <^nd \\Y^n^, <(i + wy^w,. 

Proof. Let the SVD of X is X = Ux^xVi, with Ux G W'^'p^, Ex G W^^'p^, 
and Vx G M^^''^. Let the SVD of Y is Y = UySyV^, with Uy G R"^''^, Sy G 
M^YXPY^ and Vy G R^^^^, Let [Vt, S] = Barrier Sampling 1 1 (Ux, Uy,r). By Lemma 
19, (Tn,in(U|fiS) > (1 - y^px/r), which imphes ||(U^fiS)+||2 < (1 - y^/r) and 
rank(Uxf^S) = px- Also ||Uy1]S||2 < (1 + a/py/V) because CTmaxlUyfiS) < (1 + 
Vpy/V). Thus, 

||(X^fiS)+||2 = ||(VxExU^nS)+||2 = ||(UinS)+(VxSx)+||2 
< ll(U^^^S)+||2||(VxSx)+||2; 

||Y^(]S||2 = ||VySyU^fiS||2 < || VySy || 2 || U^l^S || 2- 

(The inequahties follow from submultiplicativity, and (a) uses Lemma 2.) To con- 
clude, observe that ||(VxSx)+||2 = ||(X^)+||2, and ||VySy||2 = ||Y^||2. ■ 

Lemma 72. Let X G R"^'^ and Y G R^^" with respective ranks px.py; let px < 
r <n. One can deterministically construct in 0{rnp\ + n€) time a sampling matrix 
Vt G M"^*" and a positive diagonal rescaling matrix S G R''^'' such that 



[x^fiS)+||2< (i-y^) IKx^)- 



and llYfiSllF < IIYIU. 



Proof. Set S] = BarrierSamplingIII{Ux,Y ,r) and follow the argument in 
Lemma 71. ■ 
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A. 10 Proof of Lemma 21 

We start with the following definition, which is Definition 1 in [122] (properly 
stated to fit our notation). 

Definition 73 (Johnson-Lindenstrauss Transform). Let A G M™^", parameters 
< 6,e < 1 and function f. A random matrix R G M"^*" with r = fi(^^p^/(5)) 
forms a Johnson-Lindenstrauss transform with parameters e,6,m or JLT{e,6,m) 
for short, if with probability at least 1 — 6 for all rows A(j) G M^^" of A: 

(l-e)||A(,)||^<||A(,)i?||^<(l + e)||A(,)||^. 

We continue with Theorem 1.1 of [1] (properly stated to fit our notation and 
after minor algebraic manipulations), which indicates that the (rescaled) sign matrix 
M of the lemma corresponds to a JLT[e, 6, m) transform as defined in Definition 73 
with/(5)=log(l). 

Theorem 74 (Achlioptas [1]). Let A G M'"^'" anc? < e < 1. Let R ^ be a 

rescaled random sign matrix with r = p log(|) logm. Then with probability 1 — 5: 

(l-e)||A(,)||^<||A(,)i?||^<(l + e)||A(,)||2. 

Statement 1. The first statement in our lemma proved in Corollary 11 of [122]. 
More specifically, this corollary indicates that if i? is a JLT{e, 6, m) with r = 
0(A;/e^ log(|)), then, the singular values of the subsampled V^i? are within rel- 
ative error accuracy from the singular values of with probability at least 1 — 6. 
The failure probability 0.01 in our lemma follows by assuming cq sufficiently large. 

Statement 2. This matrix multiplication bound follows from Lemma 6 of [122]. 
The second claim of this lemma says that for X G M"^^" and Y G W^'', if i? G M"^'' 
is a matrix with i.i.d rows, each one containing four-wise independent zero-mean 
{1/y/r, —1/y/r} entries, then 

E [|[XY - XRR^YWl] < -|[X|[||[Y|[^. 
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The statement in our lemma follows because our rescaled sign matrix satisfies the 
four-wise independence assumption. Actually, our rescaled sign matrix has i.i.d 
rows with each one having r-wise independent zero-mean {1/ ^/r, — l/-y/r} entries. 
Assuming r > 4 the claim follows. 



Statement 3. These bounds appeared in Lemma 8 in [122]. Again, [122] assumes 
that the matrix has i.i.d rows, each one containing four-wise independent zero- mean 
{1/ y/r, entries. The statement in our lemma follows because our rescaled 

sign matrix satisfies the four-wise independence assumption. 



Statement 4. Let X = V^i? G M^^'" with SVD X = UxSxV^. Here, Ux G M^^^ 



and Ex G 



i>kxk 



and Vx e W""". Consider taking the SVD of {VI:R)+ and {VjRf: 



2, 



smce Vx and can be dropped without changing the spectral norm. 

Let Y = — Sx G M.''^^ be a diagonal matrix. Then, for all i = 1, k: 



1 - (X) 



Since Y is a diagonal matrix: 



lYlU = max lYjJ = max 

l<i<fc l<i<A: 



< max 

l<i<fc 



^.^(X) ^ 2e_ 



< 



2e 



< 3e. 



In the first inequality, |cri(X)| = cri(X), because the singular values are nonnegative 
numbers. Also, |1 - af{X)\ < 1 - crf{X}, because 1 - af{X) > 0, since af{X) < 1 
(from our choice of e and the left hand side of the bound for the singular values from 
the first statement of the Lemma). The second inequality follows by the bound for 
the singular values of X from the first statement of the Lemma. The last inequality 
follows by the assumption that < e < |. 
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Statement 5. Notice that there exists a sufficiently large constant Cq such that 
r > cok/e"^. Setting x = ||Xi?|||,, using the third statement of the Lemma, the fact 
that k > 1, and Chebyshev's inequality we get 

Prd. - E H I > .||Xf,l < < < ^ < 0.01. 

The last inequality follows by assuming cq sufficiently large. Finally, taking square 
root on both sides concludes the proof. 

Statement 6. Let E = A^- (Ai?)(Vfc i?)+Vj G M'"''". By setting A = Afe + A^.^ 
and using the triangle inequality for the Frobenius matrix norm: 

||E||^ < \\A,-AkR{VjR)+Vl\\F + \\Ap_kR{VlR)+Vl\\F. 

The first statement of the Lemma implies that rank(V^-R) = k thus (V^-R)(V^i?)"'" = 
Ifc. Replacing A^ = UfcE^Vj and setting (V^R)(VJR)~^ = Ik we get that 

\\Ak-AkR{VlRyvl\\F = \\Ak-\Jk^kylR(VlRVyI\\F = ||Afc-UfcSfcV^||^ = 0. 

To bound the second term above, we drop V^, add and subtract Ap_fci?(V^i?)"'", 
and use the triangle inequality and spectral submultiplicativity: 

||A,_fci?(Vji?)+V^||^ < \\A,^kRiy^kRf\\F+ ||A,_fci?((vri?)+-(Vji?)T)||^ 

< \\A,^kRR^Yk\\F + \\A,^kR\\F\\{ylRV -{VlRfh- 

Now, we will bound each term individually. A crucial observation for bounding 
the first term is that Ap.feV/t = Up^fcSp-fcVjL^Vfc = Omxfc by orthogonality of 
the columns of Vk and Vp-k- This term now can be bounded using the second 
statement of the Lemma with X = Ap_k and Y = V^. This statement, assuming 
Co sufficiently large, an application of Markov's inequality on the random variable 
X = II Ap_fci?i?"'"Vfc — Ap_fcVfc|||., and taking square root on both sides of the resulting 
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equation, give that w.p. 0.99, 



k\\F ^ 



< O.SellA 



p~k\\F- 



The second two terms can be bounded using the first statement of the Lemma and 
the third statement of the Lemma with X = Ap_fc. Hence by applying a union 
bound, we get that w.p. at least 0.97: 



The last inequality holds thanks to our choice of e G (0, 1/3). 
A. 11 Proof of Lemma 48 

Statement 1. Notice that this is a matrix-multiplication- type term involving the 
multiplication of the matrices E and Z. The sampling and rescaling matrices fi, S 
indicate the subsampling of the columns and rows of E and Z, respectively. Eqn. (4) 
of Lemma 4 of [43] gives a bound for such matrices fi, S constructed with randomized 
sampling with replacement and any set of probabilities pi,p2, ■■■,Pn (over the columns 



Notice that EZ = OmxA.- by construction. Now, replace the values of pi = — 
Definition 12) and rearrange: 



E||^ < ||A,_fci?i?TVfc||p+||A,_,,i?||^||(V^i?)+-(V^i?)T 



< 0.5e||Ap_fc||f + /(rT^||Ap_fc||i.-3e 

< 0.5e||Ap_fe||F + 3.5e||Ap_fe||i7 
= Ae-\\Ap^k\\F- 



of E): 




E[||EfiSS^fi^Z|||] < -||E 



Finally, apply Markov's inequality to the random variable x 



Enss^n^zwj, to 



wrap up. 
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Statement 2. First, by Lemma 13 and our choice of r, w.p. 1 — 6: 

1 - e < a'^iV^nS) < 1 + e. 

Let X = Z^nS e M'^^^- with SVD: X = UxSxV^. Here, Ux G and Ex G 

and Vx G M^'^^ By taking the SVD of X+ and X^: 

||(ZTfiS)+ - {Z^nSfh = llVxS^^Ui - VxSxU^^lh = ||Vx(Sx' - Sx)Ui||2 = 

since Vx and Ux can be dropped without changing the spectral norm. Let Y = 
- Sx G M^'''^' be diagonal; Then, for alH = 1, A;: 

_ W(X) 

a,(x) • 

Since Y is a diagonal matrix: 



lYlU = max lYjjl = max 

l<j<fc l<i<fe 



1 - (X) 



a.(X) 



|l-a2(X)l e 
max , , < 



i<i<k cri(X) VI - e' 



The inequality follows by using the bounds for crf{yi) from above. The failure 
probability is 6 because the bounds for (rf (X) fail with this probability. 

Statement 3. First, notice that from the first statement of the lemma (jfc(Z^ilS) > 
w.p. 1 — 5. This implies that Z'^^IS is a rank k matrix w.p 1 — 5, so, with the 
same probability: 

Now notice that: 



BZ^ - BZ^nS(Z^f^S)+Z^ = BZ^ - BlfcZ^ = 



T _ . . 

' 'mxn- 



138 



Next, we manipulate the term ||BZ'^-AfiS(Z^fiS)+Z'^||ir as follows (A = BZ'^+E): 
||BZ^- AfiS(Z^QS)+Z^||F = II BZ^ - BZ^l]S(Z^l]S)+Z'^-Ef]S(Z^fiS)+Z^||i. 



Om X n 



= ||Ef]S(Z^l]S)+Z^||i.. 

For notational convenience, let X = (Z^fiS)^ — (Z^fiS)"'". Finally, we manipulate 
the term ||EfiS(Z^fiS) + Z^||i. as follows: 

||El]S(Z^f]S)+Z^||i. < ||Ef]S(Z^l]S)+||i. 

< ||El]S(Z^fiS)^||F+ ||EfiS||^||X||2 

Pk 1 e 

^ a/-:^||E|IpH — ^ Elp- 



< I a/^ + ^^= I IIEI 



~ V21n(4)v^ ^ v^v^r^J "^"^ 
< ^IIEII. 

The first inequality follows by spectral submultiplicativity and the fact that ||Z"'"||2 = 
1. The second inequality follows by the triangle inequality for matrix norms. In 
the third inequality, the bound for the term ||Ef2S(Z"'"f2S)"'"||i;' follows by the first 
statement of the lemma w.p. 1 — 5, HEfiSHi? is bounded using Lemma 14 w.p. 1—5, 
while we bound ||X||2 by the second statement of the lemma w.p. 1 — 5. So, by the 
union bound, the failure probability so far is 35. The rest of the argument follows 
by our choice of r, assuming A; > 1, e < |, and simple algebraic manipulations. 
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A. 12 A Lower Bound for Spectral Column-based Matrix 
Approximation 

Theorem 75. For any a > 0, any k > 1, and any r > k, there exists a matrix 
A e M"''" for which 

||A-CC+A||2 n + a'^ 
||A- Afe||2 - r + a2 

Here C is any matrix that consists of r columns of A. As a ^ 0, this implies a 
lower bound of n/r for the approximation ratio of the spectral norm column-based 
matrix reconstruction problem. 

Proof. We extend the lower bound in [34] to arbitrary r > k. Consider the matrix 

A = [ei + ae2, ei + aeg, . . . , ei + ae„+i] G R(^+^^'<^, 

where e^ G M"+^ are the standard basis vectors. Then, 

A'^A = 1^1^ + aHn, cri(A) = n + a'^, and cr,f (A) = for i > 1. 

Thus, for all k > 1, ||A — A/dH = a^. Intuitively, as a — 0, A is a rank-one matrix. 
Consider any r columns of A and note that, up to row permutations, all sets of 
r columns of A are equivalent. So, without loss of generality, let C consist of the 
first r columns of A. We now compute the optimal reconstruction of A from C as 
follows: let be the j-th column of A. In order to reconstruct a^, we minimize 
||aj — Cx||2 over all vectors x G M*^'. Note that if j < r then the reconstruction error 
is zero. For j > r, a^ = ei + aej^i and Cx = ei J2i=i + J2l=i ^i^i+i- Then, 

(r \ ^ ( ^ X^r 

^a;i - 1 |+a^Xiei+i-ej+i||2 = +a^^a;,^+l. 
i=\ ) 2=1 \i=\ ) i=l 

The above quadratic form in x is minimized when Xj = (r + a^) ^ for alH = 1, . . . , r. 
Let A = A — CC^A and let the j-th column of A be a^. Then, for j < r, a^ is an 
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all-zeros vector; for j > r, kj = ae^+i — X]i=i ^i+i- Thus 



A A 



Orxr Orx{n—r) 
0(n— r)xr ^ 



where Z = ^:^^l„_f,l^„^ + a^ln-r- This immediately implies that 



||A-CC+A||^ = ||A||^ = ||A A|h = l|Z| 



in — r)a^ n n + , 



r + a'^ 



n + a 
r + a'^ 



\\A-A,\\l. 



This concludes our proof. 



APPENDIX B 
VERY RECENT WORK ON MATRIX SAMPLING 

ALGORITHMS 

While writing the results of this thesis, we learned the existence of two independent 
concurrent works that study low-rank column-based matrix approximation in the 
Frobenius norm and coreset construction for constrained least-squares regression, 
respectively. The recent work on Column-based Matrix Reconstruction appeared as 
a technical report in [68]; while, the recent work on coreset construction for Linear 
Regression will appear soon in the Proceedings of the 43rd ACM Symposium on 
Theory of Computing (STOC). We comment on these results below. 

B.l Column-based Reconstruction in the Frobenius Norm 

Two exciting results for column-based matrix reconstruction appeared recently 
in [68]. More specifically. Theorem 1 in [68] describes an algorithm that on input a 
matrix A G R™^", a target rank k, and an oversampling parameter r > k, constructs 
C G M™-^'' deterministically such that: 

||A - CC+A||^ < \ ||A - A,\\l. 
r + 1 — k 

Moreover, this algorithm runs in 0(rnm^ log(m)). Notice that, for any e > 0, it 
suffices to select only r = k + ^ — 1 columns to get a relative error approximation. 
Theorem 2 in [68] presents a faster algorithm that runs in 0{rnm?) (assuming 
m <n) and constructs C G M™^'' randomly such that: 

E [||A - CC+A||^] < l|A - A4l. 

Both these results are based on volume sampling approaches (see Section 3.1.3). The 
work of [68] was motivated by applications of column selection to approximation 
algorithms for Quadratic Integer Programming [69]. Moreover, [68] proved a new 
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lower bound: 

||A-CC+A||| ^ ^ k 



> I + --0 I 



which imphes that the bounds from [68] are asymptotically optimal. Finally, we 
should note that the bounds from [68] hold for the residual error ||A — CC'''A||i;' 
and it is not obvious if they can be extended to the rank k matrix nQ^(A). Recall 
that for arbitrary r > k: 

||A-CC+A||p< ||A-n^,,(A)||^. 

So, an upper bound for the term ||A — CC^A||i? does not imply an upper bound 
for the term ||A - ng^^(A)||^. 

B.2 Coreset Construction for Linear Regression 

Feldman and Langberg in [52] proved that a (1 + e)-coreset with r = 0{n/e^) 
rows can be found w.p. 0.5 in 0(mn^ + log(m)/e^) time. Although not explicitly 
stated, it is easy to see that the coreset of [52] applies to arbitrary constrained 
regression. The upcoming full version of [52] will give all the details of this result. 
The novel technical ingredient in [52] is the randomized fast analog of Corollary 18 
of Section 3.1.6. [52] gives a coreset which is as good (in terms of coreset size) as 
ours in Theorem 42. The running time of the method of [52] is better than ours; this 
trades the failure probability introduced in [52]. Recall that our theorem constructs 
the coreset deterministically. 



